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ANY  OPINIONS,  FINDINGS,  AND  CONCLUSIONS  OR  RECOMMENDATIONS 
EXPRESSED  IN  THIS  PUBLICATION  ARE  THOSE  OF  THE  AUTHOR  AND  DO 
NOT  NECESSARILY  REFLECT  THE  VIEWS  OF  THE  NATIONAL  SCIENCE 
FOUNDATION. 


THE  FINDINGS  IN  THIS  REPORT  ARE  NOT  TO  BE  CONSTRUED  AS  AN  OFFI¬ 
CIAL  DEPARTMENT  OF  THE  ARMY  POSITION,  UNLESS  SO  DESIGNATED  BY 
OTHER  AUTHORIZED  DOCUMENTS. 


Abstract 


We  discuss  methods  for  solving  the  unconstrained  optimization  problem  on  parallel  computers, 
when  the  number  of  variables  is  sufficiently  small  that  quasi-Newton  methods  can  be  used.  We  concen¬ 
trate  mainly,  but  not  exclusively,  on  problems  where  function  evaluation  is  expensive.  First  we  discuss 
ways  to  parallelize  both  the  function  evaluation  costs  and  the  linear  algebra  calculations  in  the  standard 
sequential  secant  method,  the  BFGS  method.  Then  we  discuss  new  methods  that  are  appropriate  when 
there  are  enough  processors  to  evaluate  the  function,  gradient,  and  part  but  not  all  of  the  Hessian  at  each 

■  I 

iteration.  WcUCvelup  new  algorithms  that  utilize  this  information  and  analyze  their  convergence  proper¬ 
ties  Wc  present  computational  experiments  showing  that  they  are  superior  to  parallelization  of  either  the 
BFGS  method  or  Newton's  method  under  our  assumptions  on  the  number  of  processors  and  cost  of  func¬ 
tion  evaluation.  Finally  wc  discuss  ways  to  effectively  utilize  the  gradient  values  at  unsuccessful  trial 
po::  -s  that  arc  available  in  our  parallel  methods  and  also  in  some  sequential  software  packages. 
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1.  Introduction 


This  paper  discusses  parallel  quasi-Newton  methods  for  solving  the  unconstrained  optimization  prob¬ 
lem, 

minimize  /  :R"  ->R  .  (j  ]) 

Our  main  emphasis  is  on  new  methods  that  effectively  utilize  multiple  processors  to  perform  multiple  func¬ 
tion  and  derivative  evaluations  simultaneously.  We  predominantly  use  these  multiple  function  evaluations 
to  calculate  or  approximate  derivative  values;  this  results  in  new  methods  that  have  different  derivative 
information  available  than  in  standard  sequential  algorithms.  Both  the  theoretical  properties  and  the  com¬ 
putational  performance  of  these  new  methods  arc  discussed.  In  addition,  we  consider  the  parallelization  of 
the  main  linear  algebra  costs  of  such  methods. 

The  unconstrained  optimization  problem  (1.1)  arises  in  many  applications  in  science,  engineering, 
and  other  areas,  and  is  often  very  expensive  to  solve.  Frequently  this  because  the  evaluation  of/(x)  itself 
is  expensive,  often  requiring  many  seconds  or  minutes  on  modem  computers.  Problems  with  expensive 
function  evaluations  are  our  main  concern  in  this  paper.  It  is  commonly  the  case  in  such  problems  that  ana¬ 
lytic  derivatives  are  not  available;  we  concern  ourselves  mainly,  but  not  exclusively  with  this  case. 

Due  to  the  expense  of  many  unconstrained  optimization  problems,  there  is  ample  incentive  for  trying 
to  solve  them  on  parallel  computers.  If  the  leading  expense  is  the  evaluation  of  /(x)  and  its  derivatives, 
then  one  possibility  is  simply  to  parallelize  each  of  these  evaluations.  The  effectiveness  of  this  approach 
depends  on  how  readily  a  parallel  routine  for /  (x)  (and  its  derivatives)  is  available,  and  how  fully  it  paral¬ 
lelizes  the  evaluation.  In  any  case,  this  approach  usually  is  outside  the  domain  of  the  optimization  algo¬ 
rithm  designer.  In  this  paper,  we  concentrate  on  the  opposing  case  when  the  evaluation  of  /(x)  is  assumed 
to  be  sequential,  and  parallelism  is  introduced  in  the  optimization  algorithm  itself.  This  approach  will  be 
appropriate  whenever  a  good  parallel  implementation  of  /(x)  is  not  available,  or  when  the  remaining  costs 
of  the  optimization  algorithm  (such  as  linear  algebra)  are  significant.  In  addition,  on  a  massively  parallel 
machine  our  approach  might  effectively  be  combined  with  parallel  evaluation  /(x)  in  a  multilevel  parallel 


Since  we  are  interested  in  performing  multiple  evaluations  of  an  arbitrary  function  /  (x ),  or  its 
derivatives,  concurrently,  our  parallel  methods  require  a  MIMD  computer.  This  is  a  computer  which  can 
perform  different  calculations  on  different  data  at  the  same  time.  By  contrast,  an  SIMD  computer,  which 
performs  the  same  calculation  on  different  data  at  the  same  time,  will  not  be  appropriate  in  general,  since 
each  evaluation  of  a  complex  function  will  in  general  require  a  different  sequence  of  arithmetic  operations. 

Almost  any  kind  of  MIMD  computer  is  likely  to  appropriate  for  the  algorithms  discussed  herein. 
This  includes  both  shared  memory  multiprocessors,  or  distributed  memory  multiprocessors  such  as  hyper¬ 
cubes  or  networks  of  computers.  The  reason  is  that  the  granularity  of  the  parallel  operations,  one  or  more 
evaluations  of  /  (x),  will  overwhelm  any  communication  or  synchronization  overhead  cost  once  f  (x) 
requires  even  a  moderate  number  of  floating  point  operations.  This  issue  is  discussed  in  more  detail  in  Sec¬ 
tion  2.2.  When  n  is  not  very  large,  the  parallelization  of  the  linear  algebra  that  we  discuss  may  be  more 
appropriate  for  shared-memory  multiprocessor  than  for  distributed  memory  multiprocessors;  this  is  dis¬ 
cussed  further  in  Section  2.3. 

The  methods  discussed  in  this  paper  are  all  in  the  general  class  of  quasi-Newton  methods.  These 
include  secant  methods,  and  finite  difference  Newton  methods.  On  sequential  computers,  secant  methods 
arc  generally  used  to  solve  (1.1)  when  function  evaluation  is  expensive,  the  analytic  Hessian  V2/(x)  is 
unavailable,  and  n  is  not  too  large.  They  use  an  approximation  to  the  Hessian  matrix  that  is  formed  from 
the  gradient  values  of  the  iterates,  and  require  n2  storage  and  0(n2)  arithmetic  operations  per  iteration  (sec 
e.g.  Fletcher  [1980],  Gill,  Murray,  Wright  [1981],  Dennis  &.  Schnabel  [1983]).  They  have  been  tradition¬ 
ally  used  for  problems  with  up  to  about  100  variables,  although  with  the  greater  storage  and  speed  of  paral¬ 
lel  computers,  they  may  become  useful  for  larger  dimensional  problems.  The  finite  difference  Newton's 
method  instead  forms  a  finite  difference  approximation  to  the  Hessian  from  function  or  gradient  values,  and 
requires  n2  storage  and  <?(n3)  arithmetic  operations  per  iteration.  It  is  generally  used  when  the  analytic 
Hessian  is  unavailable  and  function  evaluation  is  inexpensive,  for  problems  of  up  to  50  to  100  variables. 

The  remainder  of  this  paper  is  concerned  with  constructing  quasi-New  ton  methods  that  are  appropri¬ 
ate  for  parallel  computers.  In  Section  2  we  discuss  the  parallelization  of  the  standard  sequential  secant 
method  for  unconstrained  optimization,  the  BFGS  method.  This  topic  could  be  considered  somewhat 
unexciting  since  no  new  optimization  algorithm  is  invoiced.  But  it  lead*  to  effective  use  of  parallel 
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processors,  and  may  be  all  that  is  needed  in  many  situations.  Furthermore,  it  leads  to  the  consideration  of 
two  important  techniques.  The  first  is  the  speculative  evaluation  of  function  values  introduced  by  Schnabel 
[1987],  which  is  also  the  basis  for  the  new  optimization  algorithms  discussed  in  Sections  3  and  4.  The 
second  is  the  effective  parallelization  of  linear  algebra  calculations.  We  compare  various  methods  for 
organizing  these  calculations,  including  the  method  of  Han  [1986]  (derived  in  a  different  way)  and  discuss 
which  method  is  best  in  which  MIMD  environments.  In  this  section  we  also  summarize  the  results  of  some 
simple  experiments  with  our  parallel  BFGS  algorithm  on  a  Sequent  shared  memory  multiprocessor. 

The  main  contributions  of  this  paper  are  contained  in  Section  3.  There  we  further  develop  a  class  of 
new  methods,  introduced  in  Byrd,  Schnabel,  and  Shultz  [1987],  that  evaluate  the  function,  gradient,  and 
part  of  the  finite  difference  Hessian  at  each  iteration.  These  are  appropriate  in  two  situations,  both  of  prac¬ 
tical  interest:  when  the  function  and  analytic  gradient  arc  naturally  computed  together  on  one  processor  and 
the  number  of  processors,  p ,  is  between  2  and  n,  or  when  the  gradient  is  approximated  by  finite  differences 
and  2n  +  \<p<(n 2+3/1  )/2.  We  extend  the  development  of  the  methods  presented  in  Byrd,  Schnabel,  and 
Shultz  [1987],  and  present  both  convergence  analysis  and  computational  results  for  what  appears  to  the 
best  of  our  new  methods. 

The  methods  discussed  in  Section  3  fall  in  between  the  BFGS  method  and  a  finite  difference 
Newton’s  method.  An  important  aspect  of  these  methods  is  that,  as  we  explain  in  Section  3,  they  arc  not 
generally  expected  to  result  in  a  speedup  of  p  over  the  BFGS  method  on  p  processors.  This  implies  that  on 
a  sequential  computer,  the  new  methods  will  generally  be  inferior  to  the  BFGS  in  terms  of  total  function 
and  derivative  evaluations  required.  For  this  reason,  this  class  of  methods  has  apparently  not  been  con¬ 
sidered  prior  to  the  start  of  our  work  on  this  subject.  But  in  many  practical  situations  on  parallel  comput¬ 
ers,  the  new  methods  will  be  shown  to  be  superior,  in  terms  of  time  required,  to  either  the  parallelization  of 
the  BFGS  discussed  in  Section  2  or  a  similar  parallelization  of  a  finite  difference  Newton’s  method.  So 
these  new'  methods  are  relevant  as  long  as  overall  speed,  and  not  just  throughput  measured  in  problems 
solved  per  processor,  is  of  interest 

In  Section  4  wc  discuss  how  a  different,  more  minor  improvement  can  be  made  to  the  parallel  BFGS 
method  of  Section  2.  It  involves  utilizing  gradient  values  at  failed  trial  points,  which  arc  available  in  the 
parallel  algorithm,  to  reduce  the  total  number  of  iterations  required  by  the  algorithm.  Computational 


results  show  that  some  savings  are  possible.  In  some  sequential  codes,  this  gradient  information  is  also 
available  and  the  same  savings  are  possible.  Finally,  Section  5  summarizes  our  results  and  discusses 
interesting  directions  for  future  research. 

2.  Parallelizing  the  Standard  BFGS  Method 
2.1.  The  Sequential  BFGS  Method 

Perhaps  the  most  commonly  used  method  for  solving  multivariate  unconstrained  optimization  prob¬ 
lems  is  the  BFGS  method.  It  is  intended  for  problems  where  the  number  of  variables  is  small  enough  that 
the  cost  of  storing  annx/i  matrix,  and  performing  0(n2)  arithmetic  operations  per  iteration,  is  acceptable; 
otherwise  conjugate  direction  methods  (see  e.g.  Gill,  Murray,  and  Wright  [1981]  or  Dennis  and  Schnabel 
[1987])  arc  used.  Generally  the  largest  n  for  which  the  BFGS  method  is  applied  has  been  around  100,  but 
this  limit  may  rise  with  the  availability  of  faster  (sequential  or  parallel)  computers  with  larger  memories. 
The  BFGS  method  is  most  appropriate  when,  in  addition,  f  (x)  is  expensive  and  second  derivatives  are 
unavailable.  Otherwise  Newton’s  method  or  a  finite  difference  Newton’s  method  may  be  faster,  although 
the  BFGS  is  sull  often  used  in  practice. 

A  high  level  description  of  a  BFGS  algorithm  is  given  in  Algorithm  2.1 .  This  description  hides  many 
details  of  the  method,  for  example  the  calculations  in  the  line  search.  But  it  is  sufficient  to  indicate  the 
important  characteristics  and  costs  of  the  method,  which  in  tum  motivate  the  parallel  methods  discussed  in 
the  remainder  of  this  paper.  For  a  more  detailed  description  of  the  BFGS  algorithm,  sec  for  example 
Dennis  and  Schnabel  [1983], 

There  arc  two  main  categories  of  expense  in  the  BFGS  algorithm  :  function  and  derivative  evalua¬ 
tion,  and  linear  algebra  calculations.  The  function  evaluations  occur  in  the  line  search,  where  /  is 
evaluated  at  one  or  more  trial  points  xk+Xkdk  (with  different  values  of  Xk ),  culminating  in  a  successful 
point  that  becomes  z».i.  Computational  experience  has  shown  that  hardly  more  should  be  required  of  the 
successful  point  than  that  it  decrease  the  value  of  /  .  In  this  case,  the  first  trial  point  is  ofi  successful,  and 


rarely  arc  more  than  two  or  three  needed;  an  average  of  1.2  -  1.5  trial  points  per  iteration  is  typical  for 
many  problems.  Either  during  or  after  the  line  search,  the  gradient  at  the  successful  next  iterate  x*«.)  also  is 
calculated.  (Very  rarely,  a  gradient  value  may  be  calculated  at  an  unsuccessful  trial  point  during  the  line 
search.) 

Thus  each  iteration  of  the  BFGS  method  generally  consists  of  one  or  more  function  evaluations  fol¬ 
lowed  by  one  gradient  calculation  at  the  last  point  where  the  function  was  evaluated.  Often  when  /  (x)  is 
expensive  to  evaluate,  no  procedure  is  available  to  calculate  the  gradient  analytically.  In  this  case,  the  gra¬ 
dient  at  any  point  x  is  approximated  by  the  finite  difference  formula 

V/(F).  =  g,  =  /(X  +  M,^,)-/(f)  (2.1) 

where  c,  is  the  i'*  unit  vector  and  p,  usually  is  set  to  macheps 1/1  lx,  I.  This  approximation  requires  n 
evaluations  of  / (x )  in  addition  to / (x).  So  when  finite  difference  gradients  arc  used,  each  iteration  of  the 
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BFGS  method  usually  requires  n+1  or  n+2  evaluations  of  /  (jc). 

Now  we  return  to  the  linear  algebra  costs  of  the  BFGS  method.  There  are  two  main  linear  algebra 
calculations  in  Algorithm  2.1,  the  calculation  of  the  step  direction  d*,  and  the  calculation  of  the  new  Hes¬ 
sian  approximation  Bi+1.  The  calculation  of  B*+i  involves  a  rank  two  update  to  Bk  that  clearly  requires  a 
small  multiple  of  n  2  operations.  The  calculation  of  d*  appears  to  require  the  solution  of  a  system  of  linear 
equations,  and  hence  0  (n3)  operations,  at  each  iteration,  but  by  either  updating  a  factorization  of  Bt  or  by 
directly  updating  the  inverses  of  Bk,  this  cost  can  be  reduced  to  a  small  multiple  of  n2  operations.  These 
techniques  as  well  as  their  consequences  for  parallel  computation  are  discussed  in  Section  2.3.  It  will  be 
seen  that  the  entire  linear  algebra  cost  of  the  BFGS  method  can  be  limited  to  In 2  +  0(n)  multiplications, 
and  the  same  number  of  additions  and  subtractions,  per  iteration.  An  important  feature  of  the  BFGS 
method  is  that  each  Hessian  approximation  B*  is  symmetric  and  positive  definite,  so  that  each  direction  d* 
is  guaranteed  to  be  a  descent  direction. 

Since  the  linear  algebra  costs  of  the  BFGS  method  arc  so  small,  it  is  easy  for  function  and  derivative 
evaluation  to  be  the  dominant  cost.  For  example,  if  gradients  are  being  approximated  by  finite  differences 
and  if  each  function  evaluation  requires  at  least  20n  multiplications  and  additions,  then  function  and 
derivative  evaluation  will  account  for  at  least  90%  of  the  total  cost  of  the  method  on  a  sequential  computer. 
(We  are  disregarding  some  other  overhead  costs,  such  as  operating  system  costs,  but  for  even  moderate  n 
our  estimates  are  accurate.)  In  fact  many  real  problems  we  have  encountered  have  function  evaluations  that 
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arc  far  more  expensive  than  this.  Therefore  in  this  paper  we  concentrate  on  parallel  approaches  that  reduce 
the  cost  of  function  and  derivative  evaluation  by  calculating  multiple  function  or  derivative  values  con¬ 
currently.  Section  2.2  discusses  concurrent  function  and  derivative  evaluation  in  the  context  of  the  stan¬ 
dard  BFGS  method,  while  Sections  3  and  4  discuss  new  optimization  methods  that  utilize  concurrent  func¬ 
tion  and  derivative  evaluation. 

It  is  still  necessary  to  consider  parallelization  of  the  linear  algebra  calculations  in  the  BFGS  method, 
for  several  reasons.  First,  if  these  calculation  are  performed  sequentially,  they  may  become  a  bottleneck  on 
a  parallel  computer.  Second,  there  are  some  problems  where  n  is  rather  large  and  function  evaluation 
rather  cheap  so  that  the  linear  algebra  costs  may  be  significant  We  consider  the  parallelization  of  the 
linear  algebra  calculations  of  the  BFGS  method  in  Section  2.3.  While  we  don't  explicitly  consider  the 
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parallelization  of  the  linear  algebra  calculations  of  our  new  methods  of  Sections  3  and  4,  the  techniques 
discussed  in  Section  2.2  are  directly  applicable  to  these  new  methods  as  well. 


2.2  Concurrent  Function  Evaluation  in  the  Standard  BFGS  Method 

In  most  problems  where  f  (x)  is  expensive  to  evaluate,  the  gradient  is  not  available  analytically. 
Instead  it  is  calculated  by  the  finite  difference  approximation  (2.1).  We  restrict  ourselves  to  this  case  in  this 
section.  The  new  approaches  of  Section  3  will  be  seen  to  apply  both  to  problems  where  V/  ( x )  is  calcu¬ 
lated  analytically  and  where  it  is  approximated  by  finite  differences. 

The  most  obvious  source  of  parallelism  in  an  algorithm  that  uses  finite  difference  gradients  is  to  per¬ 
form  the  n  extra  evaluations  of / ( x )  required  by  (2.3)  concurrently.  Up  processors  are  available,  this 
requires  [  n  ip\  concurrent  function  evaluation  steps,  steps  where  each  processor  performs  at  most  one 
function  evaluation.  The  drawback  to  this  approach  is  that  during  the  evaluation  of  f  (x)  in  the  line  search, 
the  remaining  p- 1  processors  are  idle.  If  p«n,  this  is  unimportant  since  each  finite  difference  gradient 
requires  many  concurrent  function  evaluation  steps  while  each  function  evaluation  requires  just  one  or  two, 
so  this  simple  approach  gives  good  speedups  for  expensive  /.  If  p=n,  however,  then  the  maximum 
speedup  that  can  be  obtained  on  problems  with  expensive  function  evaluation  from  parallelizing  only  the 
finite  difference  gradient  calculation  is  about  nl 2,  or  at  most  half  of  optimal.  This  is  because  both  function 
and  gradient  evaluations  require  one  concurrent  function  evaluation  step,  n-1  processors  are  idle  during 
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each  function  evaluation,  and  there  arc  at  least  as  many  function  evaluations  as  gradient  evaluations.  A 
more  precise  analysts  is  given  below.  Up  >n  ,p-n  processors  are  not  utilized  by  this  approach. 

An  improvement  on  the  above  strategy  was  suggested  by  Schnabel  [1987],  It  simply  is,  while  one 
processor  is  evaluating  / (xk+Xkdk)  during  the  line  search,  to  utilize  the  remaining  p- 1  processors  to 
evaluate  max(p-l/i )  components  of  V/ (xk ->-Xi dk).  We  refer  to  this  as  a  speculative  evaluation  of  (part 
of)  the  finite  difference  gradient.  If  xk+\kdk  is  accepted  as  the  next  iterate,  as  it  is  most  of  the  time,  then 
this  gradient  information  is  required  by  Algorithm  2.1  and  only  n+l-p  function  evaluations  remain  for  the 
finite  difference  gradient,  none  U p>n  +  \.  If  x*+/.*d*  is  not  accepted,  then  this  gradient  information  is  not 
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used  by  Algorithm  2.1,  but  nothing  has  been  lost  in  comparison  to  the  approach  described  in  the  previous 
paragraph.  Furthermore  we  show  in  Section  4  how  to  make  some  good  use  of  the  gradient  information  ai 
failed  trial  points  by  changing  the  optimization  algorithm. 

If  the  average  number  of  trial  points  per  iteration  in  the  line  search  is  t ,  then  the  strategy  of  con¬ 
current,  speculative  finite  difference  gradient  evaluation  requires 

~  +  5  (2.2) 

concurrent  function  evaluation  steps  per  iteration,  as  opposed  to  n  +  l-t-5  steps  for  the  sequential  method, 
and 

+  1+6  (2.3) 

for  the  first  parallel  method  that  parallelizes  the  finite  difference  gradient  evaluation  only.  Thus  when  func¬ 
tion  evaluation  is  the  dominant  cost,  the  new  method  will  make  nearly  optimal  utilization  of  n  +  1  or  fewer 
processors  as  long  as  8  «  1.  (Recall  that  this  is  usually  the  case  in  practice.)  The  main  cases  which  are  not 
addressed  satisfactorily  by  this  approach  are  situations  when  p  is  greater  than  n  +  1,  or  w  hen  the  gradient  is 
calculated  analytically.  These  are  addressed  in  Section  3. 

We  have  run  experiments  on  a  Sequent  shared  memory'  multiprocessor  to  show'  that  the  speedups 
predicated  by  the  above  discussion  arc  achieved  in  practice.  We  compared  a  parallel  BFGS  method  utiliz¬ 
ing  speculative,  concurrent  finite  difference  gradient  evaluations  and  the  parallel  linear  algebra  discussed  in 
Section  2.3  to  a  sequential  BFGS  algorithm.  We  chose  4  standard  test  problems  with  n=40,  the  extended 
versions  of  Rosenbrock’s  function,  Powell’s  singular  function,  Broydcn’s  tridiagonal  function,  and  the 
variably  dimensioned  function  (see  Mord,  Garbow,  and  Hilistrom  [1981]),  with  one  modification  :  we 
introduced  a  meaningless  loop  into  each  function  evaluation  so  that  the  total  cost  of  the  function  evaluation 
would  be  about  20n  flops,  meaning  that  function  evaluation  would  account  for  about  90%  of  the  cost  of  the 
entire  optimization  algorithm.  On  the  6  processors  available  to  us,  the  timed  speedups  ranged  from  5.7  to 
6.0.  These  numbers  were  in  close  agreement  to  those  predicted  by  equation  (2.2),  and  underscore  the  point 
that  if  p  <n ,  function  evaluation  is  expensive,  and  finite  difference  gradients  are  used,  then  it  is  easy  to 
parallelize  the  BFGS  algorithm  almost  fully. 


Finally,  we  note  a  different,  related  approach  that  has  been  suggested  by  several  authors  (Dixon 
[1981],  Dixon  and  Patel  [1982],  Patel  [1982],  Lootsma  [1984],  van  Laarhovcn  [1985])  for  utilizing  multi¬ 
ple  processors  during  the  line  search  in  the  BFGS  (or  Newton’s)  method.  It  is  to  utilize  the  additional  p- 1 
processors  that  are  available  while  / (xk+Xtdk)  is  being  evaluated  to  evaluate  f  (x)  at  other  trial  points,  in 
the  direction  dk  from  xk  and  perhaps  in  other  directions  as  well.  As  opposed  to  the  strategies  discussed 
above,  this  strategy  changes  the  optimization  algorithm  and,  hopefully,  sometimes  results  in  a  better  next 
iterate  and  thus  a  smaller  total  number  of  iterations  being  needed  to  solve  the  optimization  problem. 

An  interesting  question  is  whether  this  approach  is  superior  to  the  approach  discussed  above,  namely 
using  the  extra  processors  to  perform  a  speculative  evaluation  of  part  of  the  gradient  during  the  line  search. 
Note  that  the  cost  per  iteration  of  the  "extra  line  search  points”  (ELSP)  approach,  assuming  that  finite  dif¬ 
ferent  gradients  are  evaluated  concurrently,  is  given  by  (2.3).  Thus  from  (2.2),  we  see  for  example  that  if 
p  >n  ~  1  and  if  5  =  8 bfcs  -  0  for  the  BFGS  method  and  5=0  (the  best  case)  for  the  ELSP  approach,  then  the 
ELSP  approach  is  superior  to  the  speculative  gradient  method  if  and  only  if  it  requires  no  more  than 
( l-5src.O  ~  Limes  as  many  iterations  as  the  BFGS  method.  Thus,  if  hBFcs  is  close  to  0,  the  ELSP  method 
would  have  to  reduce  the  iteration  count  of  the  BFGS  by  almost  50%  to  be  superior  to  it.  We  doubt  that  this 
reduction  is  likely  in  general,  but  would  be  interested  in  computational  results  that  address  this  issue.  We 
note  finally  that  if  a  method  using  the  ELSP  approach  could  reduce  the  iteration  count  of  the  BFGS  by  a 
factor  of  2  by  always  considering  n  points  in  the  line  search,  then  this  would  in  fact  be  a  better  sequential 
algorithm  than  the  standard  BFGS  as  well. 

2.3  Parallelizing  the  Linear  Algebra  Calculations  in  the  BFGS  Method 

Aside  from  function  and  derivative  evaluations,  the  dominant  costs  in  the  BFGS  method  are  the  rank 
two  update  of  Bk  and  the  calculation  of  the  search  direction  dk  that  are  performed  at  each  iteration.  As 
mentioned  before,  these  require  at  least  0  (n2)  arithmetic  operations.  All  the  other  calculations  in  the  algo¬ 
rithm  require  at  most  0  ( n )  operations. 
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Ii  is  most  convenient  to  think  of  the  update  and  search  direction  calculation  as  being  a  pair  performed 


in  that  order,  i.e.  update  B*  to  B*.,i,  then  calculate  dk+\.  There  are  several  different  ways  to  organize  these 


calculations.  First,  either  the  sequence  of  matrices  [B* )  or  the  sequence  of  inverses  of  these  matrices 


(B*'1)  may  be  kept.  Sequencing  Bk  1  is  reasonable  because,  from  the  Sherman-Morrison- Woodbury  for¬ 


mula,  if  Bt+i  is  a  rank  two  update  of  Bk ,  then  Bk~},  is  a  rank  two  update  of  Bfl.  An  advantage  of  sequenc¬ 


ing  the  inverses  is  that  the  calculation  of  the  search  directions  dk  becomes  simple  and  cheap. 


In  addition,  no  matter  whether  Bk  or  its  inverse  is  kept,  the  approximation  can  be  kept  either  as  the 


symmetric  and  positive  definite  matrix  B*  or  B*"1,  or  as  a  factorization  of  this  matrix.  If  the  factorization  is 


kept  then  it  can  be  updated  directly  into  the  factorization  of  the  next  approximation.  The  general  approach 


of  updating  factorizations  was  introduced  by  Gill,  Golub,  Murray,  and  Saunders  [1974],  while  the  special 


form  used  for  the  BFGS  was  introduced  by  Goldfarb  [1976]. 


These  approaches  to  the  linear  algebra  calculations  of  the  BFGS  method  are  summarized  in  Table 


2.1.  For  each  approach.  Table  2.1  shows  the  basic  operations  that  are  involved,  and  their  cost  in  multipli¬ 


cations.  (The  number  of  additions  and  subtractions  is  the  same  as  the  number  of  multiplications,  or  nearly 


so,  in  each  case.)  The  upper-left  variant  is  the  most  straightforward  and  includes  a  Cholesky  factorization 


at  each  iteration;  it  is  the  only  variant  that  requires  0(n3)  operations.  The  upper-right  variant  is  the 


sequencing  of  Cholesky  factorizations  as  derived  by  Goldfarb  [1976].  It  involves  a  rank  one  update  to  the 


Cholesky  factor  Lk  of  Bk  followed  by  a  sequence  of  Given’s  rotations  that  reduce  this  updated  matrix  Jk+- 


to  a  new  lower  triangular  matrix  Lk. i  that  is  the  Cholesky  factor  of  Bk+,  (see  Dennis  and  Schnabel  [198?] 


for  detail-. >.  A  straightforward  implementation  requires  6 n2  operations  but  Goldfarb  showed  that  this  can 


be  reduced  to  2.5 n2  by  storing  some  additional  vectors. 


The  lower-left  variant  results  from  the  application  of  the  inverse  form  of  the  BFGS  update, 


d-i  d-i  ,  ’>*)*/  +  4*  (Si-Bt  ]yk)T  (sk-Bk]yk)Tyk  sksf 

B‘ •' =Bt  * - jrs - arc? — 


followed  by  the  multiplication  of  BfJ i  by  g**i  to  calculate  dk* j.  If  the  calculations  are  organized  as  fol- 


t  =  BkA  gk*\ 


z  =  Sk  -  I  +  dk 


-VV 


a 


i 
\ 

aa 


Table  2.1  —  Four  Possible  Implementations  of  the  Linear  Algebra  Calculations : 

Bt*\  =  Bk  +  rank -two- matrix 
solve  Bk+ 1  dt+i  =  -gk+ 1  for  dk*i 


Matrix  Stored  Unfactored 


Matrix  Stored  Factored 


Y=  Sk  Tyk  ,  5  =  zTyk 

z  =  z  +  ^  J‘  (2-5) 

Br+i  =  £*''  +  z  s[  +  skzT 

y=s[ gk+ 1  ,  8  =  z7g*+i 

<f*+i  =  i  +  yz  +8  Sk 

then  only  one  matrix  vector  multiplication,  and  a  rank-two  update  of  a  symmetric  matrix,  are  required,  each 
needing  n2  multiplications  as  long  as  only  the  lower  (or  upper)  triangle  of  each  Bf]  is  stored. 

The  lower-right  variant  is  to  keep  a  factorization  Af*  A//  of  Br and  update  Af*  by  the  rank-one  for¬ 
mula  for  the  BFGS  update  of  the  factorization  of  the  inverse  to  the  Mk+\  for  which  Af*»,A fL\  =  Bk~j ; .  In 


this  case  there  is  no  advantage  in  keeping  the  factors  triangular  since  the  cost  of  doing  this  would  outweigh 
the  advantage  in  calculating  dt* This  implementation  of  the  BFGS  has  received  less  attention  than  the 
others,  although  it  has  been  discussed  by  several  authors  including  Brodlie,  Gourlay,  and  Greenstadt 
[1973],  Davidon  [1975],  and  Powell  [1987]  Recently  Han  [1986]  derived  the  same  implementation  of  the 
BFGS  linear  algebra  from  a  rather  different  viewpoint. 

In  exact  arithmetic,  these  four  variants  of  the  BFGS  method  produce  identical  iterates,  and  differ 
only  in  the  number  of  operations  required.  In  finite  precision  arithmetic,  however,  they  may  produce  dif¬ 
ferent  iterates.  Optimization  folklore  has  long  held  that  the  unfactored  inverse  update  may  be  less  stable 
than  the  factored  direct  update.  Since  the  inverse  updates  appears  more  attractive  for  parallel  computation 
(see  below we  decided  to  test  this  belief  experimentally.  W’c  inserted  each  of  the  four  variants  of  the 
BFGS  update  described  in  Table  2.1  into  the  line  search  BFGS  method  in  the  UNCMIN  package  of  Schna¬ 
bel,  Koontz,  and  Weiss  [1985],  and  tested  each  on  the  test  set  of  More,  Garbow.  and  Hillstrom  [1981],  The 
differences  in  performance  were  negligible,  averaging  no  more  than  1-2%  overall  with  little  variation  on 
specific  problems.  J.  Noccdal  [1987,  private  communication]  has  obtained  similar  results  on  a  broader  set 
of  test  problems  that  included  some  specifically  designed  to  give  the  inverse  variant  difficulties.  L.  Grandi- 
nctu  [1978]  reports  similar  results. 

Thus  we  consider  any  of  the  variants  in  Table  2.1  as  valid  points  of  departure  for  the  construction  of 
parallel  BFGS  methods.  It  is  possible  that  the  difficulty  with  the  inverse  updates  may  be  greater  for  the 
DFP  update,  where  there  may  be  a  larger  tendency  to  produce  numerically  indefinite  inverse  approxima¬ 
tions,  and  that  this  may  have  been  the  basis  of  the  folklore  about  inverse  updates  that  was  then  extended  to 
include  the  BFGS.  This  possibility  was  pointed  out  to  us  by  J.  More  [1987]. 


Now  we  consider  the  implementation  of  the  linear  algebra  of  the  BFGS  method  on  parallel  comput¬ 
ers.  The  unfactored  direct  method  remains  least  attractive  alternative  on  parallel  computers  because  of  its 
high  operation  count,  coupled  with  the  fact  that  we  will  sec  that  some  of  the  cheaper  methods  parallelize 
excellently.  The  factored  direct  method  also  appears  to  be  less  attractive  than  the  two  inverse  methods. 
This  is  because  any  straightforward  implementation  of  this  approach  requires  a  sequence  of  O(n)  vector- 
vector  operations,  such  as  Given’s  rotations.  This  leads  to  a  considerably  higher  amount  of  synchroniza¬ 
tion  and  communication  than  in  the  inverse  methods,  and  also  docs  not  lead  direaU  to  matrix-vector 
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operations,  which  often  lead  to  more  efficient  utilization  of  parallel  computers. 

On  the  other  hand,  both  of  the  inverse  approaches  seem  to  lend  themselves  excellently  to  implemen¬ 
tation  on  cither  shared  or  loc  J  memory  multiprocessors.  Both  consist  only  of  matrix-vector  multiplications 
and  rank-one  updates,  which  parallelize  fully  and  can  be  implemented  as  block  operations.  On  a  shared 
memory  multiprocessor  with  p<n  processors,  we  would  expect  the  unfactored  direct  approach  to  require 
time  proportional  to  2 n2lp ,  and  the  factored  inverse  approach  to  require  time  proportional  to  4n2//> .  Other 
considerations,  such  as  caching,  seem  similar  for  the  two  approaches.  It  is  possible  that  the  rank  one 
update  of  a  triangular  matrix,  required  by  the  unfactored  inverse  approach,  would  not  parallelize  quite  as 
well  as  the  other  operations  in  conjunction  with  some  caching  policies. 

On  a  local  memory  multiprocessor,  it  appears  that,  in  order  to  avoid  excessive  communication,  the 
un factored  inverse  approach  would  need  to  store  and  update  the  full  matrix  fig1  (parutioncd  by  rows) 
rather  than  just  the  upper  or  lower  triangle.  This  raises  the  total  cost  of  the  method  to  3n2  operations  which 
narrows  the  gap  between  it  and  the  factored  inverse  approach.  Again  the  arithmetic  operauons  should 
parallcli/c  fully  for  both  approaches.  In  addition,  both  approaches  appear  to  require  the  same  amount  of 
information  to  be  communicated  per  iteration,  although  the  factored  method  seems  to  only  require  one  syn¬ 
chronization  point  whereas  the  unfactorcd  method  seems  to  require  two. 

From  the  above  discussion,  we  would  expect  the  unfactored  inverse  approach  to  be  the  best  way  to 
implement  the  linear  algebra  operauons  of  the  BFGS  method  on  a  shared  memory  multiprocessor.  It  w  ould 
also  appear  to  be  the  best  approach  for  a  local  memory  multiprocessor,  but  it  should  be  tested  against  the 
factored  inverse  approach.  On  a  d.amd  memory  muluproccssor,  the  synchronization  costs  are  small  and 
the  parallel  BFGS  should  be  efficient  for  almost  any  value.-,  ^.f  *  and  p.  For  the  parallelization  of  the 
BFGS  to  be  efficient  01.  a  local  memory  multiprocessor,  the  number  of  floating  point  operauon.  per  prrv-es- 
sor  per  iteration,  about  3 n2/'p ,  must  significantly  exceed  the  cost  of  sending  either  one  or  two  messages  that 
contain  a  total  of  about  3 n  floating  point  numbers. 

The  parallel  BFGS  code  menuoned  at  the  end  of  Sccuon  2.2  uses  a  parallel  version  of  the  unfactored 
inverse  approach.  To  lest  how  well  all  the  linear  algebra  calculations  arc  parallelized,  we  ran  this  code  on 
a  Sequent  shared  memory  multiprocessor  on  the  cheapest  possible  objective  function,  / (x  )  =  x7x.  Thus 
the  linear  algebra  calculauons  are  the  dominant  cost.  We  also  parallelized  most  of  the  0  m  t  computations. 
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although  inner  products  were  left  sequential.  We  found  that  the  speedup  on  6  processors  was  only  about 
3.7  for  n  =  40,  and  4.3  for  n  =  100.  These  results  plus  our  results  using  fewer  processors  indicated  that 
approximately  12%  of  the  code  remained  sequential  for  n  =  40,  while  approximately  8%  remained  sequen¬ 
tial  when  n  =  100.  This  indicates  the  importance  of  parallelizing  all  the  O(n)  calculations,  as  well  as  the 
O  (n2)  calculations,  in  a  parallel  implementation  of  the  BFGS  method. 


3.  Parallel  Methods  That  Use  Part  Of  The  Finite  Difference  Hessian 
3.1  Approaches  to  Using  Partial  Hessian  Information 

Wc  now  consider  a  class  of  methods  that  use  parallel  processors  to  evaluate  part,  but  not  all,  of  the 
finite  difference  Hessian  matrix  V2/  (x)  along  with  the  function  and  gradient  at  each  trial  point.  Our  orien¬ 
tation  is  towards  problems  where  function  and  derivative  evaluation  is  the  dominant  cost.  As  discussed 
previously,  this  is  the  case  for  many  practical  problems. 

The  approaches  that  we  discuss  fall  in  between  the  BFGS  method,  which  uses  only  the  function  and 
gradient  at  each  trial  point,  and  Newton’s  method,  which  uses  the  function,  gradient,  and  Hessian.  Implicit 
in  this  statement  arc  two  assumptions.  First,  that  if  we  have  enough  processors  to  evaluate  the  function, 
gradient,  and  Hessian  in  one  concurrent  function  evaluation  step,  then  we  will  do  this  and  use  a  modem 
New  ton’s  method  based  algorithm  (sec  e.g.  Mord  and  Sorensen  [1983]).  Second,  that  if  wc  do  not  have 
enough  processors  to  do  this,  then  we  will  probably  not  want  to  use  extra  concurrent  function  evaluation 
steps  to  evaluate  the  full  Hessian  at  each  iteration.  This  second  assumption  is  motivated  by  considerable 
computational  experience  (sec  e.g.  Schnabel,  Koontz,  and  Weiss  [1985])  that  shows  that  the  iterations 
saved  by  using  a  finite  difference  Newton's  method  algorithm  rather  than  the  BFGS  method  usually  do  not 
offset  the  extra  cost  per  iteration  in  function  evaluations.  The  results  of  Section  3.3  will  validate  this 
assumption. 

Thus  wc  consider  the  approach  of  partial  Hessian  evaluation  whenever  there  arc  not  enough  proces¬ 
sor  to  evaluate  the  function,  gradient,  and  Hessian  in  one  concurrent  function  evaluation  step,  but  more 
titan  enough  to  c\aluaie  just  the  function  and  gradient.  This  occurs  in  two  disunct  situations,  both  of 
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practical  interest.  The  first  is  when  the  gradient  is  evaluated  by  finite  differences  and  the  number  of  proces¬ 
sors  is  greater  than  n  +  1  but  less  than  (n2+3n+2)/2.  In  this  case,  there  are  more  than  enough  processors  to 
evaluate  the  function  and  finite  difference  gradient  concurrently  at  each  trial  point,  but  not  enough  to  evalu¬ 
ate  the  function,  finite  difference  gradient,  and  full  finite  difference  Hessian.  For  example,  on  a  64  node 
hypercube,  this  is  the  case  whenever  n  e  [10,63].  The  second  scenario  we  consider  is  when  the  analytic 
gradient  is  readily  computed  along  with  the  function  value,  so  that  it  is  most  convenient  to  computer  both 
on  one  processor,  but  the  analytic  Hessian  is  not  available.  This  is  the  case  in  a  reasonable  number  of  prac¬ 
tical  problems,  for  instance  many  optimal  control  problems.  In  this  case,  if  the  number  of  processors  is 
between  2  and  n ,  we  again  have  more  processors  than  are  needed  for  just  the  function  and  gradient,  but  not 
enough  for  the  full  finite  difference  Hessian  (which  requires  n  additional  gradient  values)  as  well. 

In  cither  of  these  cases,  the  methods  of  this  section  use  the  excess  processors  to  compute  as  large  a 
portion  of  the  finite  difference  Hessian  as  possible  at  each  iteration.  An  interesting  aspect  of  these  algo¬ 
rithms  is  that  while  they  will  be  seen  to  be  worthwhile  on  parallel  computers  whenever  the  partial  Hessian 
evaluation  uses  otherwise  unutilized  processors,  or  if  the  goal  is  absolute  speed  (rather  than  speed  per  pro¬ 
cessor),  they  are  not  in  general  the  most  efficient  methods  on  sequential  computers.  Probably  for  this  rea¬ 
son,  they  have  apparently  not  been  considered  prior  to  our  investigations. 

Byrd,  Schnabel,  and  Shultz  [1987]  proposed  a  variety  of  approaches  for  utilizing  partial  Hessian 
information,  and  examined  some  of  their  computational  and  theoretical  properties.  The  general  approach 
that  they  found  to  be  best  is  outlined  in  Algorithm  3.1.  The  remainder  of  Section  3.1  continues  the 
development  of  this  approach.  In  Sections  3.2  and  3.3  we  present  new  theoretical  and  computational 
results  about  this  type  of  method. 

Algorithm  3.1  differs  from  the  standard  BFGS  method,  Algorithm  2.1,  in  several  ways.  First,  the 
speculative  gradient  evaluation  discussed  in  Section  2.2  is  performed  at  each  trial  point  in  the  line  search. 
Second,  speculative  evaluation  of  some  portion  of  the  Hessian  also  is  performed  at  each  trial  point  in  the 
line  search.  Third,  this  partial  Hessian  information  is  incorporated  into  the  Hessian  approximation  at  each 
iteration,  following  the  standard  BFGS  update.  We  now  briefly  discuss  the  motivation  for  these  steps  and 
some  of  the  alternatives  considered  in  Byrd,  Schnabel,  and  Shultz  [1987].  Wc  also  introduce  some  new 
aspects  of  these  steps. 


Algorithm  3.1  --  Quasi-Newton  Method  Tor  Unconstrained  Optimization 
Using  Speculative  Partial  Hessian  Evaluation 

Given  xo  .  / ( *o )  .  go  =  V/(x o)  (or  finite  difference  approximation),  B0e  R'XA  positive  definite  (e  g 
B0  =  I),  q  €  [l.n-1] 

At  iteration  k  : 

(  calculate  search  direction  } 

solve  =~gk  for  dk  (  dk  is  search  direction  ) 

{  line  search  ) 

choose  set  of  q  linearly  independent  vectors  ui,  •  ■  • ,  uq 
repeat 

choose  value  of  stcplcngth  X* 

evaluate  /  (x*  +  X*d»),  V/(x*  -t  -\kdk)  (or  finite  difference  approximation),  and  finite 
difference  approximation  to  V2/  (xk  +  Xkdk )  u,  for  each  i  e  1 1 ,  q  ] 
until  x*  +  Xkdk  is  satisfactory  next  iterate 

x*.;  :=  x*  +  Xkdk 


decide  whether  to  stop  ;  if  not : 

(  update  Hessian  approximauon  ) 

st  :=x**i  -xk  ,  »  -  gt 

B,..  :=  Bk  -  +  2i>I  (  BFCS  update  ) 

S  DkSk  5 k  )k 

Bk :=  update  of  Bk.\  based  on  the  finite  difference  information  T2/ (xkt!j  u, .  t  =  l,  ■  •  • ,  q 


The  partial  Hessian  informauon  that  is  approximated  in  Algorithm  3.1  is  V2/  (x)  u,,  i=  1,  ■  •  ,  q . 
Byrd,  Schnabel,  and  Shultz  considered  two  choices  of  the  vectors  u ,  that  arc  selected  at  each  iteration  :  a 
set  of  q  unit  directions,  or  a  set  of  q  conjugate  directions  They  found  that  using  conjugate  directions  led 
to  no  significant  advantage  in  the  context  of  Algorithm  3.1,  and  that  it  caused  a  considerable  extra  linear 
algebra  cost.  Therefore,  we  only  consider  the  use  of  unit  directions  below.  That  is,  at  each  iteration  we 
select  [a.  ]  by 

choose  a  set  T*  of  distinct  integers  between  1  and  n 

u,  =er,  where  Y,  is  the  i member  of  r»  (3.1) 

This  means  that  our  algorithm  approximates  q  columns  of  V2/  (x ),  whose  indices  are  given  by  T* .  at  caJi 

iteration.  In  our  computational  implementation,  we  choose  the  sequence  of  sots  r.  to  cycle  through  the 
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indices  1  to  n . 

If  the  function  and  gradient  are  evaluated  analytically  together  on  one  processor,  then  column  i  of 
the  Hessian  at  x  =  x can  be  approximated  by  calculating  V/fx+p.e,)  where  p,  = 
mac  heps 1 7 1 1,  I ,  and  then  setting 


V2/(x)c,  =  h,  =  V/<x+M«  g. L  (x  >  (3.2) 

Thus  if  p€  [2,  n]  processors  are  available,  then  q  will  be  set  \o  p~\  and  q  columns  of  the  Hessian  will  be 
evaluated  using  q  additional  gradient  evaluations. 

If  the  gradient  is  not  available  analytically,  then  the  only  way  to  approximate  the  gradient  or  Hessian 
is  from  finite  differences  involving  function  values.  Let  T*  =  {j  I  je  (1,2.  •  •  si ),  vs'I'*).  A  new, 
efficient  way  to  approximate  the  gradient  and  q  columns  of  the  Hessian  at  x  is  to  use  the  formulas 


V2ft7\  =  th  \  -  /(*  +lLe.  +a,e,)-f(x  ■>  P,f,)-/(x  +  aJe1)  +  /(x) 
'  J  \x  “  \nj  h  “  i ,  rt 


V*  Cty 

Vf(x),  =  h,  =  f(x 

for  i  e  F* ,  ./€  T* ,  where  p,  =  macheps' 7  lx,  I  and  a,  =  macheps 14 lx,  I , 

V2y(j)  ^  ih  )(  =  /(X  -t-p,e,  +p;f,,)-/(x  -f  P,e,)-/(x  -t-p/f/)4/(J) 

iy  1  1  P7P; 

V2/  (x)„  =  (A.).  =  + 


(3.3a) 

(3.3b) 

(3.3c) 

(3.3d) 


for  i ,  ;  €  f* ,  i  sj  ,  where  [3,  =  macheps 1 3 1 x, 


V/(D,  =  h,  =  P,fl)  (3.3c) 

for  ieT*,  with  the  same  (3,.  Using  these  formulas,  we  can  approximate  the  function,  gradient  and  q 
columns  of  the  Hessian  using  (n+l-(^))  (<?  +  ])  function  evaluations.  Thus  if  p  processors  arc  available. 


we  will  choose  q  to  the  the  largest  integer  for  which  (n +l-(^  1)  f^  +  1)  <p.  A  side  benefit  of  the  above 

formulas  is  that  for  each  ieT*,  the  irt  component  of  the  gradient  is  approximated  by  central  differences 
and  hence  is  more  accurate  than  the  value  given  by  the  standard  forward  difference  approximation  (2.1 ),  at 
no  additional  cost  in  function  evaluations. 
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The  partial  Hessian  information  is  incorporated  into  the  new  Hessian  approximation  after  the  new 
iterate  x*»i  is  selected.  At  this  point  Algorithm  3.1  potentially  has  q  +  l  new  pieces  of  information  to  incor¬ 
porate  into  the  Hessian  approximation  :  the  standard  secant  equation 

st  =  ,  (3.4) 

and  the  q  finite  difference  values 

i  u ,  =  r,  ,  q  (3.5) 

where  u ,  =  eT  and  z,  is  the  finite  difference  approximation  to  column  y,  of  V2/  (jc*+i)-  We  incorporate  the 

standard  secant  equation  (3.4)  first,  and  then  the  finite  difference  information  (3.5).  This  order  seems  rea¬ 
sonable  because  the  standard  secant  equation  gives,  in  some  sense,  information  about  the  Hessian  value  in 
between  jr*  and  x*+i,  while  the  finite  difference  information  is  at  xk+\  and  hence  is  the  most  current  infor¬ 
mation.  Updating  in  this  order  means  that  the  standard  secant  equation  may  not  hold  at  the  ultimate  value 
of  5*-:,  but  in  Section  3.2  we  show  that  q-supcrlincar  convergence  still  is  retained.  Byrd,  Schnabel,  and 
Shultz  [1987]  also  considered  omitting,  or  only  temporarily  using,  the  standard  secant  equation  (3.4),  but 
their  computational  results  indicated  that  it  is  preferable  to  include  it.  This  is  the  only  possibility  con¬ 
sidered  in  this  paper. 


First  wc  incorporate  the  standard  secant  equation  (3.4)  using  the  standard  BFGS  update.  Then  there 
arc  various  ways  to  incorporate  the  finite  difference  information  (3.5).  Byrd,  Schnabel,  and  Shultz  [1987] 
show  that  using  the  PSB  update  is  simply  equivalent  to  overwriting  the  corresponding  row  and  column  with 
the  finite  difference  information.  However  their  computational  results  show'  that  using  the  BFGS  update 
may  lead  to  a  slightly  more  efficient  algorithm,  and  it  has  the  advantage  of  generating  positive  definite  Hes- 


otherwise 


j = 

Bk*  1.1 

for  1  =  1,  •  •  ,q ,  where  Bk* 1.1  =  Bt*  1  and  Bk*\  -  Bk*\4*\.  This  procedure  has  the  advantage  of  simplicity, 
but  the  possible  disadvantage  that  Bk*  1  will,  in  general,  only  obey  the  last  finite  difference  equation  of  (3.5) 
exactly. 

The  second  alternative  is  to  use  multiple  secant  updates  (Schnabel  [1983]).  Let  U  e  R"**  have  as 
its  columns  u, ,  i  =  l,...,q,  and  let  Z  e  R"**  have  as  its  columns  z,,  i  =  l....,q.  If  UTZ  =  UTV2f  (Xk*\)  is 
posiuvc  definite,  we  use  the  multiple  (rank  2 q)  BFGS  update 

Bk* i  =  Bk. i  -  Bk*,Uk  (JJlBk.iUk)-'  UlBk* ,  +  Z*  {Vi Zt)~'Zi  (3.7a. 

This  update  causes  Bt*\  to  satisfy  all  q  equations  in  (3.5),  and  to  be  positive  definite  given  that  Bk*\  is  posi¬ 
uvc  definite.  If  l  *  =  V:f  (xk* \)i'k  exactly  then  the  matrix  L'iZk  is  symmetric.  However,  if  we  use  finite 
difference  approximations  for  \\  the  discretization  error  can  cause  that  matrix  to  not  be  symmetric.  There¬ 
fore,  when  using  finite  differences,  we  replace  U[\ \  with  '/:  (L’/V*  f  \'lUk)  in  (3.7a). 

If  UTZ  is  not  positive  definite,  we  use  a  sequence  of  smaller  multiple  secant  updates  to  partialis 
enforce  (3.5).  First  we  select  the  subset  PD  of  F*  consisting  of  indices  i  for  which  the  equauons  of  (3.5) 
are  consistent  with  positive  definiteness,  i.c. 

PD  =  [i  I  ie  [1,  q]  and  ujz,  >0)  . 

Then  we  use  a  heurisue  to  select  a  maximal  subset  PD\  of  PD  for  which  C'fZj  is  positive  definite,  w  here 
L' :  has  as  its  columns  u,  for  all  i  e  PD ; ,  and  Zt  has  as  its  columns  r,  for  all  iePD  Then  we  similar]) 
select  a  subset  PD 2  containing  some  or  all  of  the  remaining  members  of  PD  ,  for  which  L’JZ;  is  positive 
definite,  where  V 2  and  Z2  are  defined  similarly.  If  any  columns  remain,  we  then  select  similar  subsets 

PD  3 . PDm,  until  each  iePD  is  in  exactly  one  subset,  and  each  VjZ,  is  positive  definite.  Then  we  use 

the  muluplc  BFGS  formula  (3.7a)  to  incorporate,  in  order,  each  of  the  equations  Bt*\  V}  -  Z, ,  for  j  going 
from  m  down  to  1,  choosing  the  backward  order  so  that  the  maximal  subset  is  incorporated  last.  That  is, 
wc  perform  the  updates, 

Bt*:*-:  =  Bk*:.,  ~  Bk*:,  V,  (l'jBk*:jV,)~'  V? Bk*:u  +  Z,  {VjZ,  )*: Zj  .  1  =m  dow n  to  1  (3.7bj 
w here  Bk*..*,  =  Bk*:  anJ  Bk*:  =  In  the  computauonal  lmplementauon,  we  replace  the  criterion 
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ujzt  >0,  which  we  have  used  above  for  simplicity  of  exposition,  with  the  criterion  uj:,  > 
mackcps ,'1  I  la,  I  lj  I  Iz,  I  Ij. 

We  have  tested  algorithms  both  the  first  alternative  (3.6)  and  the  second  alternative  (3.7),  and  noticed 
a  slight  advantage  for  the  second,  multiple  secant  approach.  Therefore  only  this  approach  is  considered  in 
the  computational  results  presented  in  Section  3.3.  In  performing  the  convergence  analysis  of  Section  3.2, 
however,  it  turns  out  that  the  techniques  we  use  to  prove  the  convergence  of  the  method  using  the  multiple 
secant  approach  (3.7)  build  upon  the  convergence  analysis  of  the  sequential  update  approach  (3.6).  There¬ 
fore  in  Section  3.2  supcrlinear  convergence  of  both  of  these  methods  of  incorporating  the  partial  finite 
difference  Hessian  information  is  proved. 


3.2  Consergence  Properties  of  Partial  Hessian  Methods 


We  now  consider  the  question  of  convergence  of  the  new  methods  discussed  in  the  previous  section. 
W  e  are  able  to  show  that  Algorithm  3.1  has  the  same  properties  of  q-superlincar  convergence  and  global 
convergence  on  uniformly  convex  functions  that  the  BFGS  method  has.  In  particular  we  are  able  to  estab¬ 
lish  results  similar  to  some  of  those  of  Powell  [1976]  and  Dennis  and  Mord  [1974],  although  w-e  will  make 
use  of  machinery  for  analyzing  secant  methods  developed  by  Byrd  and  Noccdal  [1987],  The  convergence 
results  in  this  section  will  be  proved  under  the  following  assumptions. 

Assumptions  3.1. 

fl)  The  objective  function  /  has  a  Lipschitz  continuous  second  derivative  on  the  level  set 
£2  =  {x:f  (x)<f  (jc 0) } -  Denote  the  Lipschitz  constant  by  L  . 

(2>  There  arc  positive  constants  pi  and  gt;  such  that  for  all  z  e  Rn  and  all  x  e  Q 

|il  I  lz  I  I2  <  ZTV2f  (x)z  <  |i2  I  lz  H2. 

Note  that  this  implies  that /  has  a  unique  minimizcr  *•  in  Q. 

f3>  The  line  search  used  with  Algorithm  3.1  has  the  property  that  there  exist  positive  constants  r|;  and 
t);  such  that  at  each  iteration  cither 


vv 
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,r.  *  i  *  »  'T* 

/.v.v/'.v  V 


f(xt+\tdt)<f(xky  nil-^T^rr-J2. 


f  (xt+\tdk)<f  (xt  >+n 2 Vf(xt)Tdk 


is  satisfied. 


(4)  The  line  search  has  the  property  that  if  1 1  —  1 1  and  I  lx*-*.  II  are  sufficiently  small 

then  the  stcplength  X*  =  1  will  be  used. 

(5)  The  Hessian  information  used  for  the  extra  updates  is  exact.  That  is,  z,  =  V2/  (x*+i)u,  in  (3.6),  and 
Z  =  V2/(x*.,)t/  in  (3.7  a) 

The  line  search  assumption  (3;  is  meant  to  be  as  general  as  possible.  It  can  be  shown  that  it  is 
satisfied  for  some  T};,t)2  if  X*  is  chosen  by  any  standard  procedure  such  as  the  Wolfe  conditions  (3.17-18), 
the  Goldstein  conditions,  or  any  reasonable  backtracking  strategy.  This  condition  is  discussed  in  more 
detail  by  Byrd  and  Noccdal  [19S7J.  Assumption  (4)  was  shown  to  be  satisfied  by  the  Wolfe  conditions  by 
Dennis  and  More'  [1976],  and  similar  arguments  show  that  the  Goldstein  conditions  and  backtracking  also 
satisfy  Assumption  (4). 

Theorem  3.1.  Consider  Algorithm  3.1  with  the  finite  difference  updates  made  sequcnually  by  (3.6),  and 
suppose  that  Assumptions  3.1  arc  satisfied.  Then  the  sequence  [xt )  that  is  produced  converges  super- 
Iinearly  to  the  solution  x- . 

Proof.  In  the  sequential  updating  algorithm,  the  quasi-Newton  approximation,  B,  is  updated  successively 

by  BFGS  along  step  directions  st  and  finite  difference  directions  u,,i  =  1 . q .  For  the  moment  we  will 

number  the  sequence  of  quasi-Newton  matrices  over  the  entire  algorithm  in  order  of  computation  without 
regard  to  the  type  of  update  made.  Therefore  we  denote  Bt  by  n*>  and  by  £«**t)(*-i>*o-  Like¬ 
wise  we  denote  the  directions  j*  and  «,  by  a  sequence  (ry)  where  st=r(,*; ■>*  and,  at  iteration  k, 
u,  =  •  Each  update  then  has  the  form 


.  ,  _p  5 0)G G 75 to  */*/ 

To  r  + mi 


r,  TB  ij  r,  V  r, 


where  for  each  update 


*  ^  *  t 


For  the  finite  difference  based  update  G ,  =  V2/  (x*+i),  and  for  the  step  update 


Gj  =  |F2/(xi+x.r*  )skdx. 

In  either  case  by  the  uniform  convexity  in  Assumption  (3.1.2) 


_  riT  G,r,  > 


rTr 

'/  rJ 


r  rr 

Ti  ’i 


P:. 


(3.11) 


and 


T‘\G^T‘  <tn  (3.12) 

wj  G  G  G/rj 

Now  by  Theorem  2.1  of  Byrd  and  Noccdal  [1987],  if  a  sequence  of  BFGS  updates  is  performed  with  (3.1 1  j 
and  (3.12)  satisfied  for  each  update,  then  for  any  fraction  p  c  (0,1)  there  exist  constants  pi  and  P;  such  that 
for  any  positive  integer  m  the  bounds 


\r,  il  II5o>, 


5  Pi 


(3.1?. 


and 


N B0f,  'I 
I  1  I  I 


P2 


(3.14) 


are  satisfied  for  at  least  pm  values  of  j  in  [1/ri  ].  (Note  that  the  quantity  in  (3.13)  is  the  cosine  of  the  angle 
between  r)  and  B  (jfj.)  Now  if  (3.13)  is  true  for  rt  a  step  direction  then  that  implies  that  it  is  a  strong  des¬ 
cent  direction.  To  ensure  that  many  step  directions  are  strong  descent  directions  we  take  p  to  satisfy 


Then  by  the  quoted  result  on  the  BFGS,  in  k  outer  iterations  (qJ~\)k  updates  arc  made,  of  whnh 
p(^-*T)/t  >  {q+Vzjk  satisfy  (3.13)  and  (3.14).  Of  these  at  least  (q+Vi)k  updates,  at  most  qk  arc  finite 
difference  updates  so  that  at  least  Vzk  of  the  step  directions  satisfy  (3.13)  and  (3.14). 


Now  by  Theorem  3.1  of  Byrd  and  Noceda!  [1987],  if  (x* }  is  generated  by 


X*.;  =xt  w.rt  =  xk-\kBk~'^f  {Xk.) 
where,  for  each  k,  at  least  some  fixed  fraction  of  the  directions  satisfy 
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.IB* 5*  II 
iii*  ii 


P: 


and  the  line  search  satisfies  Assumption  3.1.3,  then  (x*  ]  converges  to  x •  r-Iinearly  so  that 


^llxt-x.  1 1  <  °°.  (3.15) 

To  show  that  the  convergence  is  supcrltnear,  note  that  since  the  quasi-Newton  matrix  B  is  updated 
q*  1  umes  at  each  point  (3.15)  implies  that  the  matrices  G,  in  (3.10)  satisfy 

jf  I  IG.-V2/  (x* )  1 1  <  (i?  +  l)T^max[  1 1 x*_i— x-  II,  I  lxt-x«  II]  <  =». 

Therefore  by  Theorem  3.2  of  Byrd  and  Nocedal  [1987],  or  alternatively  by  Theorem  3.4  of  Dennis  and 
More  '1°74],  it  follows  that 


ll(Bt-V2/(x.))r*  II 

nrrn 


-o. 


>F Torn  tins  fact,  superlinear  convergence  of  the  sequence  (a* )  follows  by  Theorem  2.2  of  Dennis  and 
More  :  l'>74’  and  Assumption  3.1.4.  □ 


Now  we  consider  the  multiple  secant  update  (3.7).  It  turns  out  that  doing  the  multiple  update  using  an 
r.  -q  matrix  U  is  equivalent  to  a  sequence  of  q  simple  updates  along  a  set  of  conjugate  directions  spanning 
tin  column  space  of  U . 

Lemma  3.1.  Consider  a  sequence  of  q  standard  BFGS  updates  (3.6)  to  the  positive  definite  matrix  Bk*\ 

using  directions  u  ■ . u q  that  are  conjugate  with  respect  to  V2/ (xt*i).  Suppose  that  Assumption  3.1.5  is 

vit.'f.c.l  Then  the  resulting  matrix  is  the  same  matrix  as  results  from  a  multiple  update  of  the 

-  “a  where  the  column  space  of  the  matrix  l! ,  is  equal  to  the  span  of  {u: . uq]  as  long  as  the 

metro  l  '  V-'/  is  positive  definite. 

Proof.  First  we  note  that  the  multiple  update  depends  only  on  the  column  space  of  U .  This  is  true  since  an 
x  -j  met::'  having  die  same  column  space  as  U  must  have  the  form  UT,  where  T  is  a  nonsingular  qxq 
matrix  It  we  then  rcplaee  V  by  UT  and  Z  by  ZT  in  (3.7a)  it  is  easy  to  sec  that  the  result  is  unchanged. 

Therefore  for  the  rest  of  the  proof  wc  assume  w  ithout  loss  of  generality  that  the  columns  of  U  are 
,,.f  -  .1  .1  Since  Z  =  ' .  conjugacy  with  respect  to  V2/ (x*.;>  implies  that 


fj 


'AM 


UTZ  =  diag  ( ujzt ) 

so  that  the  last  term  in  the  multiple  update  formula  (3.7a)  is 


Z(UTZ)~'ZT  = 

If  we  consider  sequential  updating  we  see  that  the  final  matrix  is 

d  _  c  Bt+ijU,ujBk+ij  tz)  r,7,T 


-  if}Bt+uu,u, 1  Bt.ij  ,  ^  r.r,1 

w<riSj,wrA"  + Sitt- 


Note  that  the  last  sum  is  equal  to  the  last  term  in  (3.7a).  Now  consider  the  BFGS  updates  one  at  a  time. 
For  a  given  i ,  if  u,  =  2,  for;  <  i  then 

n  .  .  n  .  Bt.uUiuJl,  ,  2,2,rU; 

=  £*♦1., 

since  by  conjugacy  u,T2,  =  utTzl  =  uJV2f  (xt+i)u;  =0.  Therefore,  since  each  update  causes 

£*.: u,_i  =  z.-i,  after  <7  updates  Bt+i.^+iu,  =  2,  for  i  =  1 q ,  so  that  the  q  secant  equations  are  satisfied 

just  as  for  the  multiple  update. 

Now  consider  the  matrices 


Bk+ijU.ujBk+i; 


_  A  Bk  +  ljU,  u,'  tSk 

Hi  uMklx^u, 


D2  =  Bk^U{UBk^U)-'UTBk< 


We  have  that 


DJI  =Bk.:l'  +Z(UTZ)-]ZTU  -Bk<jq*\V 
~  Bk+\U  +  Zt  ~Zk  =  Bkt\U  =  D 2U . 

Therefore  since  the  matrix  Bk^U  has  rank  q  and  D  j  and  D2  are  rank  q  matrices,  it  follows  that  both 
matrices  have  the  same  range,  the  column  space  of  Bk+\U .  Since  they  are  symmetric  they  also  have  the 
same  null  space.  Together  w  ith  the  fact  that  D \U  -  D 2U  this  implies  that  D  j  =  D2.  Therefore  by  (3.16)  it 
follows  that  the  resulting  matrices  arc  equal.  □ 


Note  that  for  this  result  we  have  not  used  any  of  the  Assumptions  3.1  except  the  last  one.  Although 
we  have  stated  this  lemma  for  a  complete  multiple  update  of  the  form  (3.7a),  it  is  dear  that  the  proof 
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applies  to  each  of  ihe  partial  multiple  updates  (3.7b)  as  long  as  each  of  the  matrices  (x*  *])£',  =  UjZ, 

in  (3.7b)  arc  positive  definite.  Thus  the  sequence  of  multiple  updates  (3.7b)  is  equivalent  to  a  sequence  of 
single  BFGS  updates  using  conjugate  directions  in  the  same  order. 

Given  the  equivalence  result  of  Lemma  3.1  the  convergence  of  the  multiple  update  version  of  our 
algorithm  follows  immediately.  Since  the  convergence  analysis  assumes  positive  definiteness  of  V2/  ( x ),  a 
complete  multiple  update  will  always  be  possible  in  Algorithm  3.1,  and  we  need  only  consider  the  form 
(3.7a). 

Theorem  3.2.  Consider  Algorithm  3.1  with  the  finite  difference  information  incorporated  using  the  multi¬ 
ple  update  (3.7a)  with  the  matrix  V  having  full  rank  at  each  iteration,  and  suppose  that  Assumptions  3.1 
are  satisfied.  Then  the  sequence  (x* )  produced  converges  superlinearly  to  the  solution  x • . 

Proof.  By  Lemma  3.1,  Algorithm  3.1  using  a  multiple  update  is  equivalent  to  Algorithm  3.1  with  a  sequen¬ 
tial  update  along  a  set  of  directions  conjugate  with  respect  V2/  (x*+1)  and  spanning  the  column  space  of  V . 
By  Theorem  3.1  that  version  of  Algorithm  generates  a  sequence  which  converges  to  x •  superlinearly.  □ 

We  have  thus  shown  that  both  versions  of  Algorithm  preserve  the  convergence  properties  of  the 
BFGS  method.  It  is  interesting  to  note  that  Theorems  3.1  and  3.2  put  absolutely  no  conditions  on  the 
choice  of  U  except  that  it  have  full  rank  at  each  step.  Of  course  this  is  true  because  we  are  only  trying  to 
show  that  the  extra  updates  do  not  interfere  with  the  good  properties  of  the  BFGS.  One  might  hope  that 
there  is  some  theoretical  result  showing  that  the  finite  difference  updates  actually  improve  the  convergence 
behavior  of  the  algorithm  in  some  wav,  but  we  have  not  been  able  to  find  one.  It  is  interesting  to  note  that 
Byrd,  Schnabel  and  Shultz  [1987]  prove  that  if  the  step  update  is  omitted  (or  removed  after  step  computa¬ 
tion)  the  resulting  algorithm  is  -step  quadratically  convergent,  and  this  result  depends  very  strongly 

on  how  the  finite  difference  directions  are  chosen.  However,  as  mentioned  in  Section  3.1  that  method  per¬ 
forms  more  poorly  in  numerical  experiments  than  the  method  analyzed  here. 
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3.3  Computational  Performance  of  Partial  Finite  Difference  Hessian  Methods 

We  have  tested  the  partial  Hessian  Algorithm  3.1  on  a  variety  of  test  problems.  Byrd,  Schnabel,  and 
Shultz  [1987]  report  the  results  of  tests  for  the  case  g  =  1  only,  on  a  set  of  problems  from  Mord,  Garbow, 
and  Hillstrom  [1981]  with  small  values  of  n  .  Here,  we  report  on  tests  of  Algorithm  3.1  for  the  full  range  q 
=  1  to  n.  When  q>  1,  we  incorporate  the  partial  finite  difference  Hessian  information  by  the  multiple 
secant  procedure  (3.7).  The  test  problems  considered  are  a  combination  of  problems  from  Mord,  Garbow , 
and  Hillstrom  [1981]  and  Conn,  Gould,  and  Toint  [1986],  run  with  the  values  n  =  20  and  n  =  40.  They  are 
listed  in  Table  A1  in  the  appendix.  The  standard  starting  point  was  used  for  all  problems  except  #15, 
where  (-0.5, 0.5,  ■  ■  ■  ,-0.5,  0.5)  was  used  because  our  BFGS  algorithm  overflowed  from  the  standard  start 
point  (-1,  1,  ■  ■  •  ,—  1 ,  1). 

The  implementation  of  Algorithm  3.1  that  we  tested  was  obtained  by  modifying  the  BFGS,  line 
search  algorithm  in  the  UNCMIN  unconstrained  optimization  software  package  (Schnabel,  Koontz.  anJ 
Weiss  [1975])  in  two  ways.  First,  at  each  iteration  the  finite  difference  information  update  (3.7)  was  added 
after  the  standard  BFGS  step  update,  as  explained  in  Section  3.1.  Second,  the  backtracking  line  search  in 
UNCMIN,  in  which  each  iterate  satisfies  the  condition 

/(•*». 0  </(**)  +  a^f{xk)Tdk  (3.17) 

for  a=10-\  w  as  augmented  so  that  each  iterate  also  satisfies 

Vf{xk.,)Tdk  >^f(xt)Tdk  (3.18.) 

where  3-0.9  (using  Algorithm  A6.3.1mod  in  Dennis  and  Schnabel  [1983]).  With  condition  (3.18),  a  posi¬ 
tive  definite  step  update  is  always  possible.  The  BFGS  algorithm  used  for  comparison  was  the  same  algo¬ 
rithm  without  any  finite  difference  information.  The  Newton’s  method  algorithm  used  for  comparison  was 
the  line  search,  Newton’s  method  algorithm  in  UNCMIN;  if  the  Hessian  is  indefinite,  it  uses  a  modified 
Cholcsky  decomposition  strategy  described  in  Dennis  and  Schnabel  [1983]  to  perturb  the  Hessian  and  cal¬ 
culate  the  line  search  direction.  The  standard  UNCMIN  stopping  conditions,  described  in  Schnabel, 
Koontz,  and  Weiss  [1985],  were  used.  The  tests  were  run  in  double  precision  on  a  VAX  780. 

We  arc  primarily  interested  in  the  performance  of  this  method  on  parallel  computers  when  function 
evaluation  is  expensive.  As  discussed  in  Section  2.1,  function  evaluation  does  not  have  to  be  vers  expen- 
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sue  before  it  swamps  al!  other  cosls  of  the  BFGS  method  on  sequential  computers.  Even  on  a  local 
memory  multiprocessor,  once  each  function  evaluation  requires  several  thousand  floaung  point  operations, 
the  cost  of  function  evaluations  is  likely  to  swamp  all  costs  including  synchromzauon  and  commumcauon. 

Thus  we  will  evaluate  Algorithm  3.1  for  each  value  of  q  by  simply  counting  the  number  of  tnal  point 
function  evaluauors  it  requires  to  solve  each  problem,  (i.e.  the  total  number  of  points  tried  in  the  line 
searches  and  all  the  iterations,  plus  the  starting  point).  If  there  are  enough  processors  to  evaluate  the  func¬ 
tion.  gradient,  and  q  columns  of  the  finite  difference  Hessian  in  one  concurrent  function  evaluation  step, 
then  the  number  of  trial  point  function  evaluations  is  equivalent  to  the  number  of  concurrent  function 
evaluauon  steps  and  is  indicative  of  the  cost  of  Algorithm  3.1  on  a  parallel  computer,  for  expensive  func¬ 
tions.  The  speed  of  Algorithm  3.1  is  then  compared  to  the  speed  of  the  parallel  BFGS  method,  imple¬ 
mented  as  discussed  in  Section  2.2.  This  parallel  BFGS  method  is  assumed  to  use  speculative  gradient 
evaluations  so  that  the  function  and  gradient  are  evaluated  in  one  concurrent  function  evaluauon  ste;  ,  but 
any  additional  processors  are  unused. 

The  raw  computational  results  for  our  method  for  various  values  of  q.  as  well  as  for  the  BFGS 
method  and  Newton’s  method,  are  given  in  Tables  A2  and  A3  in  the  appendix  for  n  ~  20  and  40  respec¬ 
tively.  This  data  is  summarised  in  Tables  3. 1-3.3.  Tables  3.1  and  3.2  give  the  simulated  average  speedup-*, 
over  the  parallel  BFGS  method,  that  we  obtained  for  each  value  of  n  and  several  values  of  q .  These  aver¬ 
age  speedups  were  computed  by  taking  all  the  problems  solved  correctly  by  both  methods  for  a  given  value 
of  q  and  n ,  and  dividing  the  total  number  of  trial  points  required  by  the  BFGS  method  on  all  these  prob¬ 
lems  by  the  total  number  of  trial  points  requtred  by  the  new  method  on  all  these  problems.  This  is  a  rea¬ 
sonable  measure  of  speedup  under  the  assumpuons  that  function  evaluation  is  expensive  and  there  are 
enough  processors  to  evaluate  the  function,  gradient,  and  q  columns  of  the  Hessian  simultaneously.  Prob¬ 
lems  not  solved  successfully  for  one  or  both  methods  arc  excluded  when  computing  the  speedups  in  Tables 
3. 1-3.3;  w  e  noticed  no  significant  difference  in  the  success  rates  of  the  various  methods. 

If  the  function  and  gradient  arc  evaluated  together,  analytically,  by  one  processor,  then  Tables  3.1- 
3.2  reflect  the  use  that  Algorithm  3.1  could  make  of  from  2  to  n+1  processors.  If  the  gradient  is  evaluated 
by  finite  differences,  then  the  "standard"  BFGS  method  that  is  used  as  the  comparison  itself  requires  n  +  1 
processors,  and  Tables  3. 1-3.2  reflect  the  use  that  Algorithm  3.1  could  make  of  from  2n-l  to  t  n  'r.  -  2'  2 
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Table  3.1  -•  Average  Speedup  of  Algorithm  3.1  over  Parallel  BFGS  Method,  n=20 


Average  Speedup 

Ratio  of  Processors  Needed 
by  Alg  3.1  vs.  BFGS,  both 
with  Analytic  Gradients 

Ratio  of  Processors  Needed 
by  Alg  3.1  vs.  BFGS,  both 
w  ith  Finite  Diff.  Gradients 


1.86  2.03  2.55  2.51  2.67  3.17 

2  3  4  5  6  11 

1.95  2.86  3.71  4.52  5.29  8.38 


Table  3.2--Average  Speedup  of  Algorithm  3.1  over  Parallel  BFGS  Method,  n=40 


Average  Speedup 
w  ithout  problem  16 

Ratio  of  Processors  Needed 
by  Alg  3.1  vs.  BFGS,  both 
with  Analytic  Gradients 

Ratio  of  Processors  Needed 
by  Alg  3.1  vs.  BFGS,  both 
with  Finite  Diff.  Gradients 


1.98  2.93  3.85  4.76  5.63 


9.66  15.88 


Table  3.3  --  Average  Speedup  of  Algorithm  3.1  over  Parallel  Newton’s  Method 
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processor*.  The  rauos  of  ihe  number  of  processors  required  by  Algorithm  3.1  to  the  number  required  by 
BFGS  are  given  in  Tables  3. 1-3.2  for  both  scenarios.  It  should  be  kept  in  mind  that  in  the  finite  difference 
case,  the  BFGS  method  which  is  the  baseline  is  already  a  parallel  algorithm  that  uses  the  speculative  gra¬ 
dient  evaluation  discussed  in  Section  2.2.  It  achieves  an  average  speedup  of  17.5  and  34.3  over  the  sequen¬ 
tial,  one  processor  BFGS  algorithm  in  the  cases  n  =  20  and  n  =  40,  respectively.  (This  indicates  that  the 
average  number  of  trial  points  evaluated  per  itcrauon  in  the  line  search  is  about  1.2.)  Thus  Algorithm  3.1 
actually  achieves  average  speedups  over  the  sequential  BFGS  method  of  32.6  to  69.5  for  q  ranging  from  1 
to  n  when  n  =  20,  and  52.8  to  83.7  (64.5  to  184.2  without  problem  16)  for  q  ranging  from  1  to  n  when  n 
=  40. 


There  are  two  important  conclusions  from  Tables  3. 1-3.2.  First,  the  new  methods  clearly  derive  a 
considerable  gain  in  speed  from  the  extra  Hessian  information  that  they  use.  Second,  this  gain  is  not  usu¬ 
ally  proportional  to  the  ratio  of  processors  (or  equivalently,  pieces  of  derivative  information  per  iteration' 
that  they  use.  However,  this  was  to  be  expected  since  we  know  that  Newton's  method,  which  uses  roughly 
n  '2  times  as  much  information  (and  n  or  n! 2  umes  as  many  processors)  as  the  BFGS  method,  is  not  usu¬ 
ally  n  2  times  as  fast  in  terms  of  the  number  of  trial  points,  or  iterations,  required.  In  fact  on  these  test  sets, 
(finite  difference)  Newton's  method  is,  on  the  average,  4.7  and  7.1  umes  as  fast  as  the  BFGS  method  in  the 


cases  n  =  20  and  n  =  40,  respectively.  The  new  method  does  a  reasonable  job  of  obtaining  an  increasing 
speedup  as  q  changes  from  1  to  n.  What  is  most  satisfying  is  that  the  speedups  are  quite  substantial  for 
small  value*  of  q  before  leveling  off;  they  arc  at  least  50FT  of  optimal  for  q  up  to  about  4. 

There  is  one  test  problem,  #16  (Variably  Dimensioned  Problem),  where  the  performance  of  Algo¬ 
rithm  3.1  is  considerably  worse  than  in  any  other  case,  especially  when  n  =  40.  This  is  die  only  problem 
where  the  performance  of  Algorithm  3.1  with  q=n  is  substantially  worse  than  Newton’s  method,  and  the 
case  ^  =  1  has  by  far  the  worst  performance  of  any  lest  problem  relative  to  the  BFGS.  We  are  continuing  to 
study  our  algorithm  to  attempt  to  understand  this  behavior  and  sec  if  it  can  be  avoided.  Since  this  one 
problem  so  strongly  influences  our  average  statistics  in  the  case  n  =  40,  Tables  3.2  and  3.3  also  show  what 
the  averages  would  be  without  problem  16. 

Table  3.3  compares  the  performance  of  Algorithm  3.1,  with  various  values  of  q ,  to  the  performance 


gradient  is  evaluated  by  finite  differences  and  that  there  are  just  enough  processors  to  evaluate  the  function, 
finite  difference  gradient,  and  q  columns  of  the  finite  difference  Hessian  simultaneously.  This  means  p  = 
(n->-l-4f ;  ((?-!).  The  parallel  finite  difference  Newton’s  method  is  assumed  to  use  the  most  efficient 

parallel  strategy.  That  is,  at  each  trial  point  it  computes  the  function,  gradient,  and  as  many  elements  of  the 
Hessian  as  the  remaining  processors  allow.  Then  if  the  trial  point  is  accepted  as  the  next  iterate,  it  uses 
- 1  concurrent  function  evaluation  steps  to  evaluate  the  remainder  of  the  Hessian,  where  = 

Thus  the  total  number  of  concurrent  function  evaluation  steps  required  by  the 

parallel  finite  difference  Newton’s  method  to  solve  a  particular  problem  is 

(;V,  x  (1  +  number  of  iterations)  +  (number  of  unsuccessful  trial  points)  ,  (3.19) 

while  for  Algorithm  3.1  it  is  the  total  number  of  trial  points  for  that  problem.  (Recall  that  the  total  number 

of  trial  points  for  a  problem  is  1  +  number  of  iterations  +  number  of  unsuccessful  trial  points.) 

For  each  value  of  q  and  n ,  the  speedup  shown  in  Table  3.3  is  the  total  number  of  concurrent  function 
evaluations  required  by  New  ton’s  method,  measured  by  (3.19),  divided  by  the  total  number  of  concurrent 
function  evaluation  steps  required  by  Algorithm  3.1,  where  the  totals  are  taken  over  all  the  problems  suc¬ 
cessfully  solved  by  both  methods.  Table  3.3  shows  that  for  all  values  of  q  <nl 2,  Algorithm  3.1  is  more 
efficient  than  a  parallel  finite  difference  Newton’s  method,  under  the  above  assumptions.  Of  problem  16  is 
included  for  n  =  40  then  the  q~ni2  case  is  slightly  worse  than  Newton’s  method  on  the  average,  but 
without  it  it  is  considerably  better.)  Thus  for  q  <  n  /2,  it  appears  to  be  better  to  evaluate  just  as  much  of  the 
finite  difference  Hessian  per  iteration  as  the  processors  allow  in  one  concurrent  function  evaluation  step, 
rather  than  using  extra  concurrent  function  evaluation  steps  to  evaluate  the  remainder  of  the  Hessian. 

Table  3.3  also  shows  that  when  q  =  n ,  Algorithm  3.1  is  slightly  inferior  to  Newton’s  method.  The 
two  methods  are  very  similar  in  this  case,  since  each  computes  the  full  finite  difference  Hessian  at  each 
iteration.  The  only  difference  is  that  Algorithm  3.1  docs  not  use  all  this  information  if  the  approximation  is 
indefinite,  while  the  finite  difference  Newton’s  method  uses  the  entire  Hessian  and  employs  the  perturbed 
Cholesky  decomposition  given  in  Gill,  Murray,  and  Wright  [1981]  to  compute  the  search  direction  when 
the  Hessian  is  indefinite.  As  might  be  expected,  often  there  is  no  difference  between  the  two  methods  but 
occasionally  discarding  some  Hessian  information  is  somewhat  detrimental  to  the  performance  of  the 
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Algorithm  3.1.  It  might  be  advantageous  to  incorporate  a  scheme  for  using  indefinite  finite  difference  Hes¬ 
sian  information  (perhaps  the  PSB  or  SRI  update)  into  Algorithm  3.1,  but  this  would  need  to  be  done  in  a 
way  that  doesn't  hurt  the  performance  of  the  method  for  small  q .  We  consider  this  a  topic  for  further 
research 


Finally,  we  also  tested  in  detail  the  version  of  Algorithm  3.1  that  uses  the  sequential  scheme  (3.6), 
rather  than  the  multiple  update  scheme  (3.7),  to  incorporate  the  finite  difference  Hessian  information.  In 
general,  the  muluplc  update  approach  required  from  5%  to  257c  fewer  iterations  and  function  evaluations 
to  solve  the  same  problems  with  the  same  value  of  q .  For  this  reason,  we  recommend  the  multiple  updat¬ 
ing  scheme. 


4.  Using  Gradient  Values  at  Unsuccessful  Trial  Points 

In  this  section  we  discuss  a  relatively  minor  improvement  that  can  be  made  to  the  parallel  BFGS 
algorithm  discussed  in  Section  2  as  well  as  to  some  sequential  BFGS  algorithms.  It  is  to  use  the  gradient 
values  that  arc  computed  at  the  unsuccessful  trial  points  in  the  line  search  to  reduce  the  total  amount  of 
work  required  to  solve  the  optimization  problem.  If  the  gradient  is  evaluated  by  finite  differences,  then  we 
assume  that  we  are  using  the  parallel  BFGS  algorithm  of  Section  2  with  p>n  + 1  so  that  the  entire  finite 
difference  gradient  is  evaluated  at  each  trial  point  If  the  analytic  gradient  is  a  by-product  of  the  function 
evaluation  on  one  processor,  then  the  information  wc  consider  is  available  in  a  standard  sequential  BFGS 
method.  It  is  also  available  in  any  sequential  unconstrained  optimization  code  that  requires  the  gradient  to 
be  returned  along  with  the  function  value;  some  unconstrained  optimization  software  packages,  for  exam¬ 
ple  CONM1N  (Shanno  and  Phua  [1978])  and  MINOS  (Murtaugh  and  Saunders  [1983])  are  organized  in 
this  way.  The  strategies  discussed  in  this  section  arc  applicable  to  all  these  situations. 


In  all  the  above  cases,  even  though  the  gradient  is  available  at  unsuccessful  trial  points,  it  is  only 
used  in  the  line  search.  Schnabel  [1987]  proposed  several  further  uses  that  might  be  made  of  this  informa¬ 
tion.  Here  w'e  pursue  the  suggestion  from  Schnabel  [1987]  that  we  consider  most  promising.  To  facilitate 
our  discussion,  let  us  use  the  simplified  notation  that  the  current  iterate  is  xc ,  the  current  gradient  is  gc ,  the 
current  Hessian  approximation  is  Bc ,  the  current  search  direction  is  dc  =  -Bj'  dc ,  the  current  trial  point  is 


x,  =  x.  -  ‘tAc ,  and  the  gradient  at  x,  is  g, .  We  assume  that  x,  is  an  unsatisfactory  choice  for  the  next  iterate 

We  v.  ill  attempt  to  use  the  gradient  at  the  unsuccessful  trial  point  x ,  to  immediately  update  the  Hes¬ 
sian  approximation  and  compute  a  new  search  direcuon,  even  though  we  have  not  successfully  concluded 
the  current  line  search.  To  motivate  this  strategy,  consider  the  case  when  /  is  a  positive  definite  quadratic. 
It  is  still  possible  that  x,  is  unsatisfactory  because  the  Hessian  approximation  Bc  is  inadequate.  The  stan¬ 
dard  BFGS  algorithm  would  continue  the  line  search  until  it  calculates  a  satisfactory  next  iterate  x*  = 
x- »Xcf- ,  where  >.  =  cX  for  some  o*l.  Let  g,  be  the  gradient  at  x+,  and  let  B«.  be  the  BFGS  update  to  Bc 
using  the  step  from  xc  to  x„.  Also  consider  the  matrix  B,  that  would  bc  generated  as  the  BFGS  update  to 
B  using  the  step  from  x,:  tor,. 

The  first  key  point  is  that  B»  =  B, ,  that  is  the  updates  obtained  from  using  x„  and  g„.  or  using  x,  and 
£:,  are  die  same.  Tins  ts  because  any  two  points  along  a  line  will  generate  the  same  secant  equation,  and 
hence  the  same  update,  for  a  quadratic  function.  (Algebraically,  x.-xc  =  c(x,-x£)  and  g.~gL-  = 
o 1  g.  -g,  '.  i  The  other  key  point  is  that  since 

x.-B:'  g.  =  Xc  -  Be1  gc  =  xc  -Bf1  8c  =  X,  -B,-1  g, 
w  uh  die  first  and  durd  equalities  coming  from  the  secant  equation  and  the  second  from  B ,  =  B, ,  we  do  not 

ha%e  to  compute  x.,  or  adopt  x,  as  the  new  current  iterate,  to  undertake  the  next  iteration.  Rather  we  can 

replace  B.  with  B,  and  conunuc  iterating  from  xc,  in  the  new  direction  -B,"1  gc.  If  a  sicplcngth  of  one  is 

used  at  the  new  iteration,  then  the  same  point  x++  will  be  generated  as  if  we  had  iterated  from  either  x  + 

using  the  direction  -BC1  g+,  or  from  x,  using  the  direction  -B,~‘  gi  ■ 


For  quadratic  / ,  this  strategy  allows  a  BFGS  algorithm  to  use  only  one  trial  point  per  update,  while 
likely  requiring  no  more  iterations  than  the  standard  BFGS  method.  If  one  is  using  the  standard  sequential 
BFGS  method,  Algorithm  2.1,  and  the  gradient  is  being  evaluated  by  finite  differences,  then  the  saving  is 
small  because  the  number  of  function  evaluations  per  iteration  is  simply  reduced  from  a  maximum  of  n+2 
to  n  -1.  If,  however,  one  is  in  any  of  the  parallel  or  sequential  scenarios  mentioned  at  the  start  of  this  sec- 
uon,  where  the  gradient  is  computed  along  with  the  function  value  at  each  trial  point,  then  this  strategy  has 
the  potential  to  cut  the  cost  of  some  iterations  in  half  (from  two  function-gradient  pairs  to  one)  which  is  a 
more  significant  savings.  The  strategy  also  has  the  appealing  property  that  it  never  selects  an  unsatisfac- 


L-V 


information  v.e  would  have  gotten  at  x+,  and  then  continues  iterating  from  xc  which  is  the  best  point  we 
have  so  fur 


When  /  is  not  quadratic,  B,  will  not  generally  equal  B and  the  strategy  of  updating  Bc  to  B,  and 
replacing  the  search  direction  from  xc ,  -Bc~]  gc ,  with  -Bf 1  gc  may  not  be  a  good  one.  Ideally,  replacing 
B  w ith  B.  would  seem  to  be  a  good  idea  if  B,  is  closer  to  V2/ (xc)  than  Bc  is  in  the  direction  dc ,  i.e.  if 

W(V:f(xc)-B,)dc  II  <  ll(V2/(xc)-Bc)d£  II  .  (4.1) 

Since  we  don’t  know  V2/ (jrc ),  however,  we  try  to  determine  whether  B,  is  a  better  approximation  to 

V2/  (or  )  than  B:  is  by  seeing  whether  the  quadratic  model  around  xc  using  B,  predicts  /  (*,)  better  than  the 

quadratic  model  around  xc  using  Bc  does,  i.e.  if 

i /  (x.  >  -  gj d.  +  djB,  dc  -  /  (x, )  I  <  1/ (x, )  *  gj d,  +  dj Bc  d:  -  f  (x,)\  .  (4.2 1 

i Note  that  it  is  not  necessary  to  form  B.  to  check  (4.2)  since  we  know  B,  dc  =  g,-ge.)  If  (4.2)  is  satisfied, 

then  it  seems  advantageous  to  calculate  B,  and  d,  =  -Br!  gc  and  change  the  line  search  direction  from  xc 
to  d;.  Otherwise  it  seems  better  to  continue  the  line  search  from  xc  in  the  direction  d{  in  the  normal 
fashion. 

We  have  tested  this  approach  on  the  same  problem  set  as  was  used  in  Section  3.3  (sec  Appendix  Al). 
We  compared  the  normal  BFGS  algorithm,  Alg.  2.1,  to  an  algorithm  that  differs  in  that  it  updates  the  Hes¬ 
sian  approximauon  and  sw  itches  line  search  directions  as  described  above  if  the  line  search  finds  an  unsa¬ 
tisfactory  point  x:  which  fails  (3.17)  and  satisfies  (4.2).  (In  addition  x,  must  satisfy  {g,~gc)T dc  >  0  to 
assure  that  the  update  will  retain  positive  definiteness.)  In  this  case  the  new  strategy  makes  one  other 
alteration  so  that  a  satisfactory'  next  iterate  will  eventually  be  found  :  rather  than  starting  the  line  search  in 
the  new  direction  d:  w  ith  a  steplength  of  one,  it  chooses  the  initial  stcplcngth  in  the  new  direction  so  that 
the  length  of  the  next  trial  step  is  the  same  as  if  the  line  search  had  been  continued  in  the  old  direction  dc . 
That  is,  if  the  next  stcplcngth  in  the  direction  dc  would  have  been  X,  it  chooses  the  initial  steplength  X  in 
the  direction  d,  to  bc  X  I \dc  1 1  /  I \d,  1 1 .  If  the  next  iterate  again  fails  (3.17)  but  satisfies  (4.2)  and  the 
positive  curvature  condiuon,  then  the  Hessian  and  search  direction  is  changed  again  with  the  steplength 
again  being  reduced  by  this  mechanism;  otherwise  the  line  search  is  continued  with  the  new  line  search 
direction.  This  approach  is  continued  until  a  satisfactory  next  iterate  is  found  As  soon  as  a  satisfactory 
next  iterate  is  found,  the  next  line  search  starts  with  steplength  one,  as  usual. 


On  our  lest  set,  we  found  that  this  strategy  reduced  the  average  number  of  trial  point  function  evalua¬ 
tions  needed  to  solve  the  problems  by  about  37<  in  the  case  n  =  20,  and  by  about  127c  in  the  case  n  =  40. 
This  reduction  is  indicative  of  the  reduction  in  computational  cost  in  any  situation  where  the  gradient  is 
computed  at  each  trail  point,  and  function  evaluation  is  the  overriding  cost.  Recall  from  Section  3.3.  that 
for  these  test  problems,  only  about  207c  of  the  total  iterations  have  unsuccessful  trial  points,  so  that  at  least 
for  n  =  40  the  observed  savings  arc  fairly  satisfactory.  If  we  use  the  ideal  (and  impossible  in  practice)  test 
(4.1)  instead  of  (4,2)  to  decide  when  to  use  our  new1  strategy,  the  average  saving  rises  to  157c  in  the  case  n 
=  20  but  drops  to  Sfc  in  the  case  n  -  40,  This  indicates  that  there  may  be  room  for  improvement  in  our 
results  if  we  can  find  a  better  heuristic  than  (4.2)  to  decide  when  to  invoke  our  strategy  of  switching  line 
search  directions. 

While  these  savings  are  not  dramatic,  they  point  to  a  small  improvement  that  can  be  made  to  die 
BFGS  algorithm  whenever  the  gradient  is  available  at  each  trial  point,  i.c.  both  in  the  parallel  BFGS 
method  discussed  in  Section  2  and  in  some  sequential  BFGS  codes.  It  also  leads  us  to  be  interested  in 
some  related  ideas  that  we  mention  in  the  next  section. 

5.  Summary  and  Directions  for  Future  Research 

In  Section  2  we  have  shown  that  it  is  fairly  easy  to  efficiently  utilize  up  to  ni-1  processors  in  the 
standard  BFGS  algorithm  for  unconstrained  optimization,  in  two  different  situations.  First,  if  function 
evaluation  is  expensive  and  gradients  are  evaluated  by  finite  differences,  then  by  evaluating  the  gradirm 
along  w  ith  the  function  at  every  trial  point  one  can  generally  realize  at  least  70-807c  efficiencies  with  up  to 
n  + 1  processors.  Secondly,  if  n  is  large  enough  that  the  linear  algebra  costs  of  the  method  are  significant, 
then  it  is  also  fairly  easy  to  parallelize  the  Linear  algebra  efficiently  for  up  to  n  processors.  This  causes  us 
to  reexamine  the  various  implementations  of  the  BFGS  update,  and  to  chose  the  unfactored  update  of  the 
inverse  matrix  which  appears  to  be  the  cheapest  sequential  and  parallel  approach  and  to  have  no  noticeable 
finite  precision  difficulties  in  comparison  to  the  other  possible  approaches. 

Several  variations  on  the  approaches  of  Section  2  merit  investigation.  One  is  whether  it  would  be 
better  to  use  extra  processors  to  evaluate  multiple  points  simultaneously  during  the  line  search,  as  proposed 
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by  several  authors  including  Dixon  [1981],  Dixon  and  Patel  [1982],  Patel  [1982],  Lootsma  [1984],  and  van 
Laarhoven  [1985J,  rather  than  performing  speculative  finite  difference  gradient  evaluations.  A  second 
issue  is  whether  the  modification  of  the  BFGS  method  recently  proposed  by  Powell  [1987]  has  a  significant 
advantage  in  practice,  and  if  so,  how  it  (different)  linear  algebra  is  best  implemented  on  a  parallel  com¬ 
puter. 

In  Section  3  we  have  examined  the  situation  when  function  evaluation  is  expens  ve  and  there  are 
more  processors  than  are  needed  to  evaluate  the  function  and  gradient  simultaneously.  This  occurs  if  the 
gradient  is  evaluated  by  finite  differences  and  the  number  of  processors,  p ,  is  greater  than  n+ 1,  or  if  the 
gradient  is  evaluated  analytically  along  with  the  function  on  one  processor  and  p>  1.  If  there  are  enough 
processors  so  that  we  can  evaluate  the  function,  gradient,  and  (finite  difference)  Hessian  concurrently,  then 
we  would  use  a  parallel  implementation  of  Newton’s  method.  If  not,  then  we  have  proposed  using  new 
optimization  algorithms  that  use  the  function,  gradient,  and  q  <n  columns  of  the  finite  difference  Hessian  at 
each  iteration.  These  can  be  thought  of  as  falling  in  between  the  BFGS  method  and  Newton’s  method.  Wc 
have  shown  that  the  performance  of  these  methods  for  different  values  of  q  varies  between  the  perfor¬ 
mance  of  the  BFGS  method  and  Newton’s  method  as  might  be  expected.  We  have  also  shown  that  if  there 
are  just  enough  processors  to  evaluate  the  function,  gradient,  and  q  columns  of  the  finite  difference  Hes¬ 
sian,  and  our  new  method  is  more  efficient  than  cither  the  parallel  BFGS  method  or  a  parallel  implementa¬ 
tion  of  the  finite  difference  Newton’s  method. 

There  arc  several  interesting  research  questions  regarding  these  new  methods  that  use  part  of  the 
Hessian  at  each  iteration.  One  is  whether  it  would  be  better  to  use  an  update  that  allows  indefiniteness, 
such  as  the  SRI,  to  incorporate  the  finite  difference  information,  rather  than  using  the  BFGS  as  wc  have 
done.  This  question  is  especially  intriguing  in  light  of  the  recent  results  of  Conn,  Gould,  and  Toint  [1986] 
that  report  very  good  computational  performance  for  a  trust  region  method  using  the  SRI.  A  second  issue 
is  whether  some  different  procedure  for  choosing  the  finite  difference  directions  (u, )  that  arc  used  at  each 
iteration  would  be  preferable.  An  example  would  be  choosing  these  directions  based  upon  the  recent  step 
directions.  Another  more  general  issue  is  whether  there  are  better  ways  to  utilize  additional  processors 
than  evaluating  part  of  the  Hessian,  for  example  using  the  extra  evaluations  to  form  a  higher  order  model 
such  as  the  tensor  model  introduced  by  Schnabel  and  Frank  [1984], 


In  Section  4  we  have  considered  the  use  of  gradient  information  at  unsuccessful  trial  points  in  die 
line  search.  Such  information  is  available  in  our  parallel  methods  that  evaluate  funcuon  and  gradient  infor¬ 


mation  simultaneously,  and  also  in  several  well-known  sequenual  unconstrained  optimization  packages 
We  have  shown  that  we  can  modify  the  BFGS  algorithm  to  utilize  this  information  in  a  way  that  leads  to 
small  gains  in  efficiency. 

This  work  on  using  derivative  information  from  unsuccessful  trial  points  might  be  extended  in  a 
number  of  directions.  In  the  parallel  methods  that  use  partial  Hessian  information,  one  could  also  consider 
whether  the  Hessian  information  from  unsuccessful  trial  points  could  be  utilized.  In  this  connection  one 
might  want  to  consider  evaluating  the  partial  Hessian  information  at  the  current  iterate  xc  rather  than  at  die 
trial  point  x, .  We  also  have  not  yet  considered  the  case  when  p<n  where  only  pan,  rather  than  all,  ot  the 
finite  difference  gradient  is  evaluated  at  the  unsuccessful  trial  point. 
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Table  Al  --  Test  Problem  Set 


Problem 

Name 


Trigonometric 


Extended  Rosenbrock 


Extended  Powell  Singular 

Chebyquad 

Chained  Singular 


Generalized  Wood 


Chained  Wood 


Source  of 
Problem 


MGH26 


MGH21 


MGH22 


MGH35 


Generalized  Broyden  Tridiagonal  (,a>  CGT10 
Generalized  Broyden  Tridiagonal  (b  1  CGTl  1 


Generalized  Broyden  Banded  (a) 
Generalized  Broyden  Banded  (bi 


Toint-Brovden  7  Diagonal 


Toint  Trigonometric 
Generalized  Cragg  and  Levy 


Generalized  Brown 


Variable  Dimensioned 


Penalty  Function  1 
Penally  Function  II 


CGT  =  Conn,  Gould,  Toint  [1986] 

MGH  =  More,  Garbow,  Hillstrom  [1981] 


CGTl  3 


CGT14 


CGT  16 


CGT17 


CGT21 


MGH25 


MGH23 


MC.H24 


WWW 


Table  A2  -■  Test  Results,  n=20 


Iterations 

Unsuccessful  Tnal  Points _ 

BFGS  Algorithm  3.1  Newton' 


<7  =  1 

- 

<7=3 

<7=4 

<7=5 

£7  =  10 

o, 

11 

Method 

1 

47 

28 

27 

23 

23 

16 

12 

12 

12 

6 

5 

9 

13 

15 

12 

4 

6 

6 

-1 

46 

94 

80 

84 

72 

68 

45 

24 

24 

21 

19 

26 

38 

55 

44 

16 

8 

8 

3 

48 

75 

47 

41 

36 

26 

24 

15 

15 

19 

1 

2 

4 

6 

0 

4 

0 

0 

4 

54 

49 

47 

-- 

32 

34 

— 

-- 

.. 

15 

8 

13 

— 

12 

19 

— 

-- 

.. 

5 

308 

57 

41 

37 

31 

29 

27 

20 

20 

21 

4 

T 

i- 

3 

1 

0 

0 

0 

0 

6 

164 

133 

134 

103 

107 

101 

86 

54 

51 

32 

23 

56 

48 

49 

77 

50 

38 

28 

271 

50 

118 

66 

58 

51 

69 

48 

49 

33 

10 

45 

37 

21 

14 

59 

27 

27 

8 

56 

34 

24 

20 

16 

16 

14 

10 

10 

4 

0 

0 

0 

0 

0 

0 

0 

r 

9 

21 

20 

14 

13 

11 

10 

7 

5 

5 

3 

0 

0 

0 

0 

0 

0 

0 

0 

10 

125 

32 

25 

20 

18 

17 

15 

12 

12 

6 

0 

0 

0 

0 

1 

0 

0 

0 

1 1 

107 

22 

15 

13 

12 

10 

9 

7 
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Table  A3  —  Test  Results,  n=40 


Problem 

Iterations 

Number 

Unsuccessful  Trial  Points 

BFGS 

Algorithm  3.1 

<7  =  1  <7=2  <7 

l/-'. 

II 

1 

II 

<7  =  10 

Newton’s 

Method 


8 

17 

17 

15 

-- 

3 

4 

18 

29 

-- 

7 

83 

46 

24 

24 

9 

70 

30 

8 

8 

5 

25 

24 

15 

15 

iteration  limn  (500),  --  =  overflow 
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